I stumbled upon this article on the new Nvidia RTX GPUs

corres · Post by **corres** » Tue Nov 10, 2020 7:39 pm

I think if somebody use Leela for position analyze the "classical speed test" give more real results than Leela built-in benchmark what is good mainly for sort time engine-engine matches.
The "classical speed test" is very simple:
Click on lc0.exe and on its window type: go nodes 10000000 and watch on the running resulted parameter-list.
When Leela display the Best move, search for in the parameter list for the most higher nps value and the
belonged depth, time, nodes numbers. It is important to know the name/mark of the used NET, because it determine the gotten speed. Other parameters (type of GPU, clock speed of GPU) are also important to report.
Sometimes 10000000 nodes is few, so you need to type: go nodes 15 -or- 20000000.

corres · Post by **corres** » Tue Nov 10, 2020 8:41 pm

For demonstration I made a classical max speed test with my RTX 2080 Ti (Gigabyte, Turbo)
GPU clock = 1620 MHz (you can read on Leela running parameter list) I also used jHorthos J92-190 net
1.test
LC0 version 0.26.3, Backend - type = cudnn, every Leela - parameters are DEFAULT!
max.seed = 12.8 Kn/sec, Depth = 10, Time = 32 sec, Nodes = 380 Kn
2.test
LC0 version 0.26.3, Backend - type = cuda, every Leela - parameters are DEFAULT!
max.speed =19.3 Kn/sec, Depth = 10, Time = 23 sec, Nodes = 414 Kn
And the max.speed of your RTX 3080 at the above parameters?

MMarco · Post by **MMarco** » Wed Nov 11, 2020 11:49 am

Laskos wrote:
MMarco wrote: I saw that Nvidia released new drivers today, with improvements on the auto-tuning feature. I'll report back when I'll have try them.
Thanks, please keep us updated here.

The new drivers did add a little bit. Nvidia auto-tuning now says +118Mhz.

corres wrote: ↑Tue Nov 10, 2020 8:41 pm For demonstration I made a classical max speed test with my RTX 2080 Ti (Gigabyte, Turbo)
GPU clock = 1620 MHz (you can read on Leela running parameter list) I also used jHorthos J92-190 net
...
2.test
LC0 version 0.26.3, Backend - type = cuda, every Leela - parameters are DEFAULT!
max.speed =19.3 Kn/sec, Depth = 10, Time = 23 sec, Nodes = 414 Kn
And the max.speed of your RTX 3080 at the above parameters?

With all default parameters after the drivers update and tuning, max speed is: 24 266 nps after 18.196s on the starting position, as time count starts at 1.671 sec.

Code: Select all

go nodes 5000000
Loading weights file from: J92-190
Creating backend [cuda-fp16]...
CUDA Runtime version: 11.1.0
Latest version of CUDA supported by the driver: 11.1.0
GPU: GeForce RTX 3080
GPU memory: 10 Gb
GPU clock frequency: 1710 MHz
GPU compute capability: 8.6
info depth 1 seldepth 2 time 1671 nodes 6 score cp 11 nps 857 tbhits 0 pv d2d4 g8f6
...
...
info depth 10 seldepth 33 time 19867 nodes 441705 score cp 12 nps 24266 tbhits 0 pv d2d4 g8f6 c2c4 e7e6 g2g3 f8b4 c1d2 b4e7 d1c2 c7c6 g1f3 d7d5 f1g2 e8g8 d2f4 b7b6 e1g1 b8d7

MMarco wrote:When I benchmark with mbs=204 (10s, all positions, J92-190) I get 24.9 knps. However with mbs=544 (twice the number of tensor cores of the card) I now get 29.8 knps

Here are benchmarks (34 positions) with all default parameters (but the batchsize):

204 all-positions, 10s
===========================
Total time (ms) : 340511
Nodes searched : 8564083
Nodes/second : 25151

256 all-positions, 10s
===========================
Total time (ms) : 340630
Nodes searched : 9072112
Nodes/second : 26633

544 all-positions, 10s
===========================
Total time (ms) : 341053
Nodes searched : 10248086
Nodes/second : 30048

I did two other benchmarks at 18 sec per position (which was the time for which the max speed is attained on the starting position - I think the nncache gets full then), again with batchsize=544 being quite faster (17%):

256 all-positions, 18s
===========================
Total time (ms) : 612619
Nodes searched : 17649352
Nodes/second : 28810

544 all-positions, 18s
===========================
Total time (ms) : 613033
Nodes searched : 20596376
Nodes/second : 33597

corres · Post by **corres** » Wed Nov 11, 2020 6:25 pm

The opinion of Leela developers is the MinibatchSize enhance the speed but maybe it decreases the Elo power.
In earlier time TCEC used MiniBatchSize = 512 or 644 but now they use (mainly) the default 256 value. I think it is the best value because the warming of GPU depend on MinibatchSize also.
Your results prove about +20 % advantage for RTX 3080, so I think the difference in it price is the decisive.
At present I use RTX 2080 Ti + 2 x RTX 2060 together with backend = multiplexing, and BackendOptions = cuda-fp16 what gives the max.power at threads = 6 (!). the NNCacheSize was enhanced to 20000000 any other are default.
Maybe I will exchange the 2 x RTX 2060 to one RTX 3080, because their power consumption is near the same (~350 Watts).

corres · Post by **corres** » Thu Nov 12, 2020 3:14 pm

For the shake of interest I made speed test my DUAL and my TRIAD GPUs also.
Test 1
My DUAL 2 x RTX 2060 OC, GPU clock speed 1770 MHz, test method and parameters are the above and
Threads = 4, Backed = Multiplexing and BackendOptions= (backend =cuda-fp16, gpu=1),
(backend=cuda-fp16,gpu=2)
Max.speed = 21.2 Kn /sec, Depth = 14, Time = 268 sec, Nodes = 5,1 Mn
Test 2
My TRIAD 1 x RTX 2080 Ti + 2 x RTX 2060 OC
Threads = 6, Backend = Multiplexing and BackendOptions = (backend =cuda-fp16,gpu=0),
(backend=cuda-fp16,gpu=1),(backend=cuda-fp16,gpu=2)
Max.speed = 44.0 Kn/sec, Depth = 14, Time = 133 sec, Nodes = 5.7 Mn
So the speed of one RTX 2080 Ti is about the speed of 2 x RTX 2060
and - at least in the case of big Nets (like J92-190) the speed of individual GPUs are near summed up.
In the case of smaller net (=bigger speeds) and larger number of GPUs we should sum of some loss in speed.

MMarco · Post by **MMarco** » Sat Nov 21, 2020 12:14 pm

Strong numbers for the 3070 posted on Leela discord;

Call on 84 wrote:I recently overclocked my Ryzen 5 3600X and my RTX 3070, so I went ahead and did the Lc0 Benchmarks. I used the J94-80 net.

Total time (ms) : 340869
Nodes searched : 7124418
Nodes/second : 20901

I overclocked my Ryzen 5 3600X to 4.5 GHz on all cores at 1.35V and my RTX 3070 was overclocked to 2.1 GHz Core Clock and 2.05 GHz Memory Clock at 1.0V.

Laskos · Post by **Laskos** » Sat Nov 21, 2020 1:20 pm

MMarco wrote: ↑Sat Nov 21, 2020 12:14 pm Strong numbers for the 3070 posted on Leela discord;
Call on 84 wrote:I recently overclocked my Ryzen 5 3600X and my RTX 3070, so I went ahead and did the Lc0 Benchmarks. I used the J94-80 net.

Total time (ms) : 340869
Nodes searched : 7124418
Nodes/second : 20901

I overclocked my Ryzen 5 3600X to 4.5 GHz on all cores at 1.35V and my RTX 3070 was overclocked to 2.1 GHz Core Clock and 2.05 GHz Memory Clock at 1.0V.

Good. Two weeks ago "felix312" used non-OC 3070 running at about 1900 MHz with 16800 nps benchmark. Maybe with up to date drivers and moderate overclock, fast CPU, 19k - 20k nps are the speeds for 3070 compared to 11 - 12k nps of 2070 in similar conditions. It does seem that 3070 is running at high frequencies, above 2000 MHz, when OC-ed. Also, 3070 will improve over time, as NVidia is just releasing their drivers and CUDA versions for 30xx series. Still, most in Europe cannot find a reasonable 3070 for less than $800-850, which is very overpriced. 3060 Ti is also on the horizon, with good benchmarks (https://www.tomsguide.com/news/nvidia-r ... be-worried), so for me it will be better to wait until at least January, when the prices will be better.

Laskos · Post by **Laskos** » Sat Nov 21, 2020 8:43 pm

corres wrote: ↑Thu Nov 12, 2020 3:14 pm For the shake of interest I made speed test my DUAL and my TRIAD GPUs also.
Test 1
My DUAL 2 x RTX 2060 OC, GPU clock speed 1770 MHz, test method and parameters are the above and
Threads = 4, Backed = Multiplexing and BackendOptions= (backend =cuda-fp16, gpu=1),
(backend=cuda-fp16,gpu=2)
Max.speed = 21.2 Kn /sec, Depth = 14, Time = 268 sec, Nodes = 5,1 Mn
Test 2
My TRIAD 1 x RTX 2080 Ti + 2 x RTX 2060 OC
Threads = 6, Backend = Multiplexing and BackendOptions = (backend =cuda-fp16,gpu=0),
(backend=cuda-fp16,gpu=1),(backend=cuda-fp16,gpu=2)
Max.speed = 44.0 Kn/sec, Depth = 14, Time = 133 sec, Nodes = 5.7 Mn
So the speed of one RTX 2080 Ti is about the speed of 2 x RTX 2060
and - at least in the case of big Nets (like J92-190) the speed of individual GPUs are near summed up.
In the case of smaller net (=bigger speeds) and larger number of GPUs we should sum of some loss in speed.

With NNCache set to 20 million, cuda fp-16 backend, net J92-190, the rest default, 2070 OC-ed (by 138 MHz using the tuner), I get to Nodes = 5.6 Mn speed of 13.7 kn/s.
With

MMarco · Post by **MMarco** » Fri Feb 26, 2021 6:21 pm

Here are some bench results with latest v27.0 Lc0 binary. There is a substancial NPS increase, especially with extra parameters:
--max-collision-events=917
--max-collision-visits=1000
--max-out-of-order-evals-factor=2.4

They were suggested on Lc0 discord, and I found them to be elo gainer (about +5 elo) on various tests. With 10x128 nets, adding --multi-gather=true on top will likely give a dramatic boost in NPS. Bench were done on my 3070 after having run the Nvidia autotuner.

Code: Select all

**J94-100**
./lc0.exe benchmark --nncache=1000000
===========================
Total time (ms) : 340832
Nodes searched  : 6504039
Nodes/second    : 19083

**J94-100**
./lc0.exe benchmark --nncache=1000000 --max-collision-events=917 --max-collision-visits=1000 --max-out-of-order-evals-factor=2.4
===========================
Total time (ms) : 341207
Nodes searched  : 7518421
Nodes/second    : 22035

**LS-15**
./lc0.exe benchmark --nncache=2000000 --max-collision-events=917 --max-collision-visits=1000 --max-out-of-order-evals-factor=2.4
===========================
Total time (ms) : 340506
Nodes searched  : 19328043
Nodes/second    : 56763

**703810**
./lc0.exe benchmark --nncache=5000000 --multi-gather=true --max-collision-events=917 --max-collision-visits=1000 --max-out-of-order-evals-factor=2.4
===========================
Total time (ms) : 340136
Nodes searched  : 86976845
Nodes/second    : 255711

MMarco · Post by **MMarco** » Sun Jan 16, 2022 8:41 am

A quick update with Lc0 28.0.
Same machine: 3070 + 10700, both at stock.

Code: Select all

./lc0.exe benchmark --backend-opts=multi_stream=true --nncache=5000000 --movetime=10000

**J94-100**
===========================
Total time (ms) : 340717
Nodes searched  : 8167468
Nodes/second    : 23971

**LS-15**
===========================
Total time (ms) : 340365
Nodes searched  : 20646975
Nodes/second    : 60661

**703810**
===========================
Total time (ms) : 340118
Nodes searched  : 81025217
Nodes/second    : 238226

I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs