## I stumbled upon this article on the new Nvidia RTX GPUs

**Moderators:** hgm, Dann Corbit, Harvey Williamson

**Forum rules**

This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

I think if somebody use Leela for position analyze the "classical speed test" give more real results than Leela built-in benchmark what is good mainly for sort time engine-engine matches.

The "classical speed test" is very simple:

Click on lc0.exe and on its window type: go nodes 10000000 and watch on the running resulted parameter-list.

When Leela display the Best move, search for in the parameter list for the most higher nps value and the

belonged depth, time, nodes numbers. It is important to know the name/mark of the used NET, because it determine the gotten speed. Other parameters (type of GPU, clock speed of GPU) are also important to report.

Sometimes 10000000 nodes is few, so you need to type: go nodes 15 -or- 20000000.

The "classical speed test" is very simple:

Click on lc0.exe and on its window type: go nodes 10000000 and watch on the running resulted parameter-list.

When Leela display the Best move, search for in the parameter list for the most higher nps value and the

belonged depth, time, nodes numbers. It is important to know the name/mark of the used NET, because it determine the gotten speed. Other parameters (type of GPU, clock speed of GPU) are also important to report.

Sometimes 10000000 nodes is few, so you need to type: go nodes 15 -or- 20000000.

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

For demonstration I made a classical max speed test with my RTX 2080 Ti (Gigabyte, Turbo)

GPU clock = 1620 MHz (you can read on Leela running parameter list) I also used jHorthos J92-190 net

1.test

LC0 version 0.26.3, Backend - type = cudnn, every Leela - parameters are DEFAULT!

max.seed = 12.8 Kn/sec, Depth = 10, Time = 32 sec, Nodes = 380 Kn

2.test

LC0 version 0.26.3, Backend - type = cuda, every Leela - parameters are DEFAULT!

max.speed =19.3 Kn/sec, Depth = 10, Time = 23 sec, Nodes = 414 Kn

And the max.speed of your RTX 3080 at the above parameters?

GPU clock = 1620 MHz (you can read on Leela running parameter list) I also used jHorthos J92-190 net

1.test

LC0 version 0.26.3, Backend - type = cudnn, every Leela - parameters are DEFAULT!

max.seed = 12.8 Kn/sec, Depth = 10, Time = 32 sec, Nodes = 380 Kn

2.test

LC0 version 0.26.3, Backend - type = cuda, every Leela - parameters are DEFAULT!

max.speed =19.3 Kn/sec, Depth = 10, Time = 23 sec, Nodes = 414 Kn

And the max.speed of your RTX 3080 at the above parameters?

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

The new drivers did add a little bit. Nvidia auto-tuning now says +118Mhz.Laskos wrote:Thanks, please keep us updated here.MMarco wrote: I saw that Nvidia released new drivers today, with improvements on the auto-tuning feature. I'll report back when I'll have try them.

With all default parameters after the drivers update and tuning, max speed is: 24 266 nps after 18.196s on the starting position, as time count starts at 1.671 sec.corres wrote: ↑Tue Nov 10, 2020 7:41 pmFor demonstration I made a classical max speed test with my RTX 2080 Ti (Gigabyte, Turbo)

GPU clock = 1620 MHz (you can read on Leela running parameter list) I also used jHorthos J92-190 net

...

2.test

LC0 version 0.26.3, Backend - type = cuda, every Leela - parameters are DEFAULT!

max.speed =19.3 Kn/sec, Depth = 10, Time = 23 sec, Nodes = 414 Kn

And the max.speed of your RTX 3080 at the above parameters?

Code: Select all

```
go nodes 5000000
Loading weights file from: J92-190
Creating backend [cuda-fp16]...
CUDA Runtime version: 11.1.0
Latest version of CUDA supported by the driver: 11.1.0
GPU: GeForce RTX 3080
GPU memory: 10 Gb
GPU clock frequency: 1710 MHz
GPU compute capability: 8.6
info depth 1 seldepth 2 time 1671 nodes 6 score cp 11 nps 857 tbhits 0 pv d2d4 g8f6
...
...
info depth 10 seldepth 33 time 19867 nodes 441705 score cp 12 nps 24266 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2 c7c6 g1f3 d7d5 f1g2 e8g8 d2f4 b7b6 e1g1 b8d7
```

Here are benchmarks (34 positions) with all default parameters (but the batchsize):MMarco wrote:When I benchmark with mbs=204 (10s, all positions, J92-190) I get 24.9 knps. However with mbs=544 (twice the number of tensor cores of the card) I now get 29.8 knps

204 all-positions, 10s

===========================

Total time (ms) : 340511

Nodes searched : 8564083

Nodes/second : 25151

256 all-positions, 10s

===========================

Total time (ms) : 340630

Nodes searched : 9072112

Nodes/second : 26633

544 all-positions, 10s

===========================

Total time (ms) : 341053

Nodes searched : 10248086

Nodes/second : 30048

I did two other benchmarks at 18 sec per position (which was the time for which the max speed is attained on the starting position - I think the nncache gets full then), again with batchsize=544 being quite faster (17%):

256 all-positions, 18s

===========================

Total time (ms) : 612619

Nodes searched : 17649352

Nodes/second : 28810

544 all-positions, 18s

===========================

Total time (ms) : 613033

Nodes searched : 20596376

Nodes/second : 33597

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

The opinion of Leela developers is the MinibatchSize enhance the speed but maybe it decreases the Elo power.

In earlier time TCEC used MiniBatchSize = 512 or 644 but now they use (mainly) the default 256 value. I think it is the best value because the warming of GPU depend on MinibatchSize also.

Your results prove about +20 % advantage for RTX 3080, so I think the difference in it price is the decisive.

At present I use RTX 2080 Ti + 2 x RTX 2060 together with backend = multiplexing, and BackendOptions = cuda-fp16 what gives the max.power at threads = 6 (!). the NNCacheSize was enhanced to 20000000 any other are default.

Maybe I will exchange the 2 x RTX 2060 to one RTX 3080, because their power consumption is near the same (~350 Watts).

In earlier time TCEC used MiniBatchSize = 512 or 644 but now they use (mainly) the default 256 value. I think it is the best value because the warming of GPU depend on MinibatchSize also.

Your results prove about +20 % advantage for RTX 3080, so I think the difference in it price is the decisive.

At present I use RTX 2080 Ti + 2 x RTX 2060 together with backend = multiplexing, and BackendOptions = cuda-fp16 what gives the max.power at threads = 6 (!). the NNCacheSize was enhanced to 20000000 any other are default.

Maybe I will exchange the 2 x RTX 2060 to one RTX 3080, because their power consumption is near the same (~350 Watts).

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

For the shake of interest I made speed test my DUAL and my TRIAD GPUs also.

Test 1

My DUAL 2 x RTX 2060 OC, GPU clock speed 1770 MHz, test method and parameters are the above and

Threads = 4, Backed = Multiplexing and BackendOptions= (backend =cuda-fp16, gpu=1),

(backend=cuda-fp16,gpu=2)

Max.speed = 21.2 Kn /sec, Depth = 14, Time = 268 sec, Nodes = 5,1 Mn

Test 2

My TRIAD 1 x RTX 2080 Ti + 2 x RTX 2060 OC

Threads = 6, Backend = Multiplexing and BackendOptions = (backend =cuda-fp16,gpu=0),

(backend=cuda-fp16,gpu=1),(backend=cuda-fp16,gpu=2)

Max.speed = 44.0 Kn/sec, Depth = 14, Time = 133 sec, Nodes = 5.7 Mn

So the speed of one RTX 2080 Ti is about the speed of 2 x RTX 2060

and - at least in the case of big Nets (like J92-190) the speed of individual GPUs are near summed up.

In the case of smaller net (=bigger speeds) and larger number of GPUs we should sum of some loss in speed.

Test 1

My DUAL 2 x RTX 2060 OC, GPU clock speed 1770 MHz, test method and parameters are the above and

Threads = 4, Backed = Multiplexing and BackendOptions= (backend =cuda-fp16, gpu=1),

(backend=cuda-fp16,gpu=2)

Max.speed = 21.2 Kn /sec, Depth = 14, Time = 268 sec, Nodes = 5,1 Mn

Test 2

My TRIAD 1 x RTX 2080 Ti + 2 x RTX 2060 OC

Threads = 6, Backend = Multiplexing and BackendOptions = (backend =cuda-fp16,gpu=0),

(backend=cuda-fp16,gpu=1),(backend=cuda-fp16,gpu=2)

Max.speed = 44.0 Kn/sec, Depth = 14, Time = 133 sec, Nodes = 5.7 Mn

So the speed of one RTX 2080 Ti is about the speed of 2 x RTX 2060

and - at least in the case of big Nets (like J92-190) the speed of individual GPUs are near summed up.

In the case of smaller net (=bigger speeds) and larger number of GPUs we should sum of some loss in speed.

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

Strong numbers for the 3070 posted on Leela discord;

Call on 84 wrote:I recently overclocked my Ryzen 5 3600X and my RTX 3070, so I went ahead and did the Lc0 Benchmarks. I used the J94-80 net.

Total time (ms) : 340869

Nodes searched : 7124418

Nodes/second : 20901

I overclocked my Ryzen 5 3600X to 4.5 GHz on all cores at 1.35V and my RTX 3070 was overclocked to 2.1 GHz Core Clock and 2.05 GHz Memory Clock at 1.0V.

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

Good. Two weeks ago "felix312" used non-OC 3070 running at about 1900 MHz with 16800 nps benchmark. Maybe with up to date drivers and moderate overclock, fast CPU, 19k - 20k nps are the speeds for 3070 compared to 11 - 12k nps of 2070 in similar conditions. It does seem that 3070 is running at high frequencies, above 2000 MHz, when OC-ed. Also, 3070 will improve over time, as NVidia is just releasing their drivers and CUDA versions for 30xx series. Still, most in Europe cannot find a reasonable 3070 for less than $800-850, which is very overpriced. 3060 Ti is also on the horizon, with good benchmarks (https://www.tomsguide.com/news/nvidia-r ... be-worried), so for me it will be better to wait until at least January, when the prices will be better.MMarco wrote: ↑Sat Nov 21, 2020 11:14 amStrong numbers for the 3070 posted on Leela discord;Call on 84 wrote:I recently overclocked my Ryzen 5 3600X and my RTX 3070, so I went ahead and did the Lc0 Benchmarks. I used the J94-80 net.

Total time (ms) : 340869

Nodes searched : 7124418

Nodes/second : 20901

I overclocked my Ryzen 5 3600X to 4.5 GHz on all cores at 1.35V and my RTX 3070 was overclocked to 2.1 GHz Core Clock and 2.05 GHz Memory Clock at 1.0V.

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

With NNCache set to 20 million, cuda fp-16 backend, net J92-190, the rest default, 2070 OC-ed (by 138 MHz using the tuner), I get to Nodes = 5.6 Mn speed of 13.7 kn/s.corres wrote: ↑Thu Nov 12, 2020 2:14 pmFor the shake of interest I made speed test my DUAL and my TRIAD GPUs also.

Test 1

My DUAL 2 x RTX 2060 OC, GPU clock speed 1770 MHz, test method and parameters are the above and

Threads = 4, Backed = Multiplexing and BackendOptions= (backend =cuda-fp16, gpu=1),

(backend=cuda-fp16,gpu=2)

Max.speed = 21.2 Kn /sec, Depth = 14, Time = 268 sec, Nodes = 5,1 Mn

Test 2

My TRIAD 1 x RTX 2080 Ti + 2 x RTX 2060 OC

Threads = 6, Backend = Multiplexing and BackendOptions = (backend =cuda-fp16,gpu=0),

(backend=cuda-fp16,gpu=1),(backend=cuda-fp16,gpu=2)

Max.speed = 44.0 Kn/sec, Depth = 14, Time = 133 sec, Nodes = 5.7 Mn

So the speed of one RTX 2080 Ti is about the speed of 2 x RTX 2060

and - at least in the case of big Nets (like J92-190) the speed of individual GPUs are near summed up.

In the case of smaller net (=bigger speeds) and larger number of GPUs we should sum of some loss in speed.

With

### Re: I stumbled upon this article on the new Nvidia RTX GPUs

Here are some bench results with latest v27.0 Lc0 binary. There is a substancial NPS increase, especially with extra parameters:

--max-collision-events=917

--max-collision-visits=1000

--max-out-of-order-evals-factor=2.4

They were suggested on Lc0 discord, and I found them to be elo gainer (about +5 elo) on various tests. With 10x128 nets, adding --multi-gather=true on top will likely give a dramatic boost in NPS. Bench were done on my 3070 after having run the Nvidia autotuner.

--max-collision-events=917

--max-collision-visits=1000

--max-out-of-order-evals-factor=2.4

They were suggested on Lc0 discord, and I found them to be elo gainer (about +5 elo) on various tests. With 10x128 nets, adding --multi-gather=true on top will likely give a dramatic boost in NPS. Bench were done on my 3070 after having run the Nvidia autotuner.

Code: Select all

```
**J94-100**
./lc0.exe benchmark --nncache=1000000
===========================
Total time (ms) : 340832
Nodes searched : 6504039
Nodes/second : 19083
**J94-100**
./lc0.exe benchmark --nncache=1000000 --max-collision-events=917 --max-collision-visits=1000 --max-out-of-order-evals-factor=2.4
===========================
Total time (ms) : 341207
Nodes searched : 7518421
Nodes/second : 22035
**LS-15**
./lc0.exe benchmark --nncache=2000000 --max-collision-events=917 --max-collision-visits=1000 --max-out-of-order-evals-factor=2.4
===========================
Total time (ms) : 340506
Nodes searched : 19328043
Nodes/second : 56763
**703810**
./lc0.exe benchmark --nncache=5000000 --multi-gather=true --max-collision-events=917 --max-collision-visits=1000 --max-out-of-order-evals-factor=2.4
===========================
Total time (ms) : 340136
Nodes searched : 86976845
Nodes/second : 255711
```