My non-OC RTX 2070 is very fast with Lc0

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Laskos »

Albert Silver wrote: Mon Dec 24, 2018 4:51 am
lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
Strange.

With that:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.txt.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000

I get:
info depth 15 seldepth 47 time 486903 nodes 17279553 score cp 24 hashfull 682 nps 35488


With that:

lc0-v20rc2.exe --cpuct=3.4 --backend=cudnn-fp16 --minibatch-size=512 --weights=11250.txt.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000

I get an almost identical:
info depth 16 seldepth 48 time 573761 nodes 20451954 score cp 24 hashfull 798 nps 35645


Both are pretty high to this net and my RTX 2070, but "roundrobin" seems to have no effect.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Laskos »

Laskos wrote: Mon Dec 24, 2018 6:06 am
Albert Silver wrote: Mon Dec 24, 2018 4:51 am
lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
Strange.

With that:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.txt.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000

I get:
info depth 15 seldepth 47 time 486903 nodes 17279553 score cp 24 hashfull 682 nps 35488


With that:

lc0-v20rc2.exe --cpuct=3.4 --backend=cudnn-fp16 --minibatch-size=512 --weights=11250.txt.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000

I get an almost identical:
info depth 16 seldepth 48 time 573761 nodes 20451954 score cp 24 hashfull 798 nps 35645


Both are pretty high to this net and my RTX 2070, but "roundrobin" seems to have no effect.
Test30 nets are much faster:

lc0-v20rc2.exe --cpuct=3.4 --backend=cudnn-fp16 --minibatch-size=512 --weights=weights_run2_32112.pb.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000


info depth 18 seldepth 49 time 297643 nodes 14395400 score cp 46 hashfull 379 nps 48364
crem
Posts: 177
Joined: Wed May 23, 2018 9:29 pm

Re: My non-OC RTX 2070 is very fast with Lc0

Post by crem »

Albert Silver wrote: Mon Dec 24, 2018 4:51 am
No, in the second case there was no GPU usage for sure. Roundrobin is a new multiGPU option I used in v20, and that is remarkably efficient in single-GPU as well. Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
There's no way roundrobin could help in single-GPU case. What roundrobin does is it alternates GPUs on every iteration. As there's just 1 GPU, it doesn't do anything and just forwards all requests to the same backend.
Albert Silver
Posts: 3019
Joined: Wed Mar 08, 2006 9:57 pm
Location: Rio de Janeiro, Brazil

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Albert Silver »

crem wrote: Mon Dec 24, 2018 10:51 am
Albert Silver wrote: Mon Dec 24, 2018 4:51 am
No, in the second case there was no GPU usage for sure. Roundrobin is a new multiGPU option I used in v20, and that is remarkably efficient in single-GPU as well. Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
There's no way roundrobin could help in single-GPU case. What roundrobin does is it alternates GPUs on every iteration. As there's just 1 GPU, it doesn't do anything and just forwards all requests to the same backend.
Ok, thanks for clarifying that. I can only assume the better CPUs are what are boosting the single GPU performance.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
Javier Ros
Posts: 200
Joined: Fri Oct 12, 2012 12:48 pm
Location: Seville (SPAIN)
Full name: Javier Ros

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Javier Ros »

Albert Silver wrote: Mon Dec 24, 2018 4:51 am Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
Your election of parameters

--cpuct=3.4
--smart-pruning-factor=0.000

is better than the default?

--cpuct=3.0
--smart-pruning-factor=1.330000
Albert Silver
Posts: 3019
Joined: Wed Mar 08, 2006 9:57 pm
Location: Rio de Janeiro, Brazil

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Albert Silver »

Javier Ros wrote: Tue Dec 25, 2018 1:43 pm
Albert Silver wrote: Mon Dec 24, 2018 4:51 am Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
Your election of parameters

--cpuct=3.4
--smart-pruning-factor=0.000

is better than the default?

--cpuct=3.0
--smart-pruning-factor=1.330000
When testing with a benchmark such as speed or tactics, smart pruning should always be turned off IMHO as this allows the engine to always use the full time. the cpuct at 3.4 is likely stronger for playing though. In a lengthy CLOP. I also converged on a cpuctbase of 53500.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Laskos »

Maybe after all I will fry the GPU or something :D.

My theoretically hardly overclockable GPU is in fact easy to overclock.
I installed two additional case fans --- very important --- one larger for GPU and one smaller for CPU case fans, and temperatures on long runs decreased by 12-14 Celsius on both CPU and GPU. The GPU temperatures never went over low 50s degrees Celsius, so low, that I decided to overclock my GPU using MSI Afterburner. From base core clock of 1620MHz I went to base 1780MHz, increasing the power limit by 6%. The Voltage is fixed for my GPU (that's why it's not very advisable to overclock by much it), but in two days of continuous long runs, nothing happened, max. GPU temperature was 61 Celsius on a very long full GPU runs (many hours), and everything is very stable. Leela speeds are now almost 10% faster on my RTX 2070. In middlegames in TCEC conditions (and same net, T32930) it churns usually from 27 kNPS to 36 kNPS, with occasional spikes of 40-50 kNPS and higher. That's more than 50% of TCEC middlegame speeds I have seen in games, and they used 2080Ti + 2080. Either their scaling is bad, or their GPUs are throttling due to temperatures or due to power limits. From benchmarks and games I saw, a regular RTX 2080Ti is only some 50% faster than my RTX 2070 now.

I guess that if RTX 2060 overclocks as easily as RTX 2070, maybe it is the best buy, although it's not clear to me that 2x GPUs scale very well (seeing TCEC and CCC).
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: My non-OC RTX 2070 is very fast with Lc0

Post by corres »

Laskos wrote: Tue Dec 04, 2018 7:04 am
Werewolf wrote: Mon Dec 03, 2018 2:48 pm
Laskos wrote: Mon Nov 19, 2018 3:00 pm Just got and installed it. With one of the latest nets, Lc0 v19 rc5 engine:

UCI commands:

setoption name Backend value cudnn-fp16
setoption name MinibatchSize value 512
setoption name NNCacheSize value 2000000
go

info depth 19 seldepth 52 time 41681 nodes 984582 score cp 27 hashfull 274 nps 23621
info depth 21 seldepth 53 time 69999 nodes 2032430 score cp 26 hashfull 431 nps 29035
info depth 22 seldepth 54 time 93937 nodes 2845554 score cp 26 hashfull 570 nps 30292

Didn't quite expect such speeds, would have been happy even with 18,000-20,000.
Some 5-6 fold improvement over GTX 1060.

My power supply is not that strong (500W), hope it stays well.

I'm not doubting your results Laskos, but I'm struggling to understand them.
Your 1060 card produced about 4.4 TFLOPS FP32. Your 2070 card is around 7.5 TFLOPS FP32 which with the new ability to use FP16 means about 15 TFLOPS.

That should make your 2070 just under 4x faster than your 1060. Instead you report 5-6x improvement.

Happy for you...but confused.
Joshua explained, and it is also explained here:
viewtopic.php?f=2&t=68448&start=44

The speed-up is at least 5 compared to GTX 1060 in almost any condition, and larger than 6 with both in "ideal" conditions.
Test net is ID11261
With GTX 1060 6GB, in ideal settings, I was never getting more than 5100 NPS with it from starting position, but with my RTX 2070, just now setting these values

setoption name Backend value cudnn-fp16
setoption name MinibatchSize value 512
setoption name NNCacheSize value 5000000
setoption name WeightsFile value .\weights_11261.txt.gz

I am getting from initial position:

info depth 19 seldepth 55 time 243657 nodes 8144964 score cp 29 hashfull 599 nps 33427

Which is 6.5x times the maximum speed from initial position for GTX 1060 6GB I got with correct settings.

But I rarely go to 4 min/move in gameplay, only in analysis. Anyway, setting the correct parameters, my RTX 2070 (non-OCed) is about 6 times faster than my GTX 1060 6GB with correct parameters in almost all time and net ID conditions (at least with these 20x256 nets).

I also checked for possible throttling, in 12 hours at full load, temperature is at max 68C, no any throttling in GPU-Z and no any problem with the power supply (a 500W one, but it seems to not complain).

As I said, I myself didn't expect these speeds from RTX 2070, I was happy if 20,000 NPS is achieved in correct conditions. So, I felt compelled to open a thread here.
Reminder 1 for Laskos
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: My non-OC RTX 2070 is very fast with Lc0

Post by corres »

Laskos wrote: Thu Dec 06, 2018 3:43 pm
Albert Silver wrote: Thu Dec 06, 2018 2:25 pm
Laskos wrote: Thu Dec 06, 2018 3:59 am
brianr wrote: Thu Dec 06, 2018 3:33 am OK, something seems off.

Why are the 2080 depth 17 nodes so many more than the depth 19 with the 2070?

Maybe I am missing something.
Thanks.
Probably different nets used. But the speed should be fairly uniform with the latter test30 nets, so nps are probably fair to compare.
I used 11250, which I had on hand.
Ok, with this net I am getting:

info depth 16 seldepth 43 time 95727 nodes 2810327 score cp 25 hashfull 643 nps 29357,

so your is about 23% higher. Having about 28% more CUDA cores at 7% higher frequency. In total 37% expected speed-up. It seems memory speed and bandwidth also matter, as those are the same in 2070 and 80. Also, the price is 40% higher. I think the most ineffective would be RTX 2080 Ti, and the most effective a dual RTX 2070.
Reminder 2 for Laskos
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Laskos »

corres wrote: Mon Apr 22, 2019 10:06 am
Laskos wrote: Thu Dec 06, 2018 3:43 pm
Albert Silver wrote: Thu Dec 06, 2018 2:25 pm
Laskos wrote: Thu Dec 06, 2018 3:59 am
brianr wrote: Thu Dec 06, 2018 3:33 am OK, something seems off.

Why are the 2080 depth 17 nodes so many more than the depth 19 with the 2070?

Maybe I am missing something.
Thanks.
Probably different nets used. But the speed should be fairly uniform with the latter test30 nets, so nps are probably fair to compare.
I used 11250, which I had on hand.
Ok, with this net I am getting:

info depth 16 seldepth 43 time 95727 nodes 2810327 score cp 25 hashfull 643 nps 29357,

so your is about 23% higher. Having about 28% more CUDA cores at 7% higher frequency. In total 37% expected speed-up. It seems memory speed and bandwidth also matter, as those are the same in 2070 and 80. Also, the price is 40% higher. I think the most ineffective would be RTX 2080 Ti, and the most effective a dual RTX 2070.
Reminder 2 for Laskos
Yes, specialist, using v19 or v20 engine and old 10xxx net format. Also, I am not sure what cache I have used. At least be a bit consistent if putting forward some performance lists. I never posted those, as people use all the different conditions. Larry Kaufman in a matter if 15 minutes reproduced my conditions and we derived that his non-OC 2080 is 21% faster than my non-OC 2070. You seem to need weeks of debates.