My non-OC RTX 2070 is very fast with Lc0

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Laskos
Posts: 9441
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Laskos » Mon Dec 24, 2018 5:06 am

Albert Silver wrote:
Mon Dec 24, 2018 3:51 am

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
Strange.

With that:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.txt.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000

I get:
info depth 15 seldepth 47 time 486903 nodes 17279553 score cp 24 hashfull 682 nps 35488


With that:

lc0-v20rc2.exe --cpuct=3.4 --backend=cudnn-fp16 --minibatch-size=512 --weights=11250.txt.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000

I get an almost identical:
info depth 16 seldepth 48 time 573761 nodes 20451954 score cp 24 hashfull 798 nps 35645


Both are pretty high to this net and my RTX 2070, but "roundrobin" seems to have no effect.

User avatar
Laskos
Posts: 9441
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Laskos » Mon Dec 24, 2018 5:24 am

Laskos wrote:
Mon Dec 24, 2018 5:06 am
Albert Silver wrote:
Mon Dec 24, 2018 3:51 am

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
Strange.

With that:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.txt.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000

I get:
info depth 15 seldepth 47 time 486903 nodes 17279553 score cp 24 hashfull 682 nps 35488


With that:

lc0-v20rc2.exe --cpuct=3.4 --backend=cudnn-fp16 --minibatch-size=512 --weights=11250.txt.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000

I get an almost identical:
info depth 16 seldepth 48 time 573761 nodes 20451954 score cp 24 hashfull 798 nps 35645


Both are pretty high to this net and my RTX 2070, but "roundrobin" seems to have no effect.
Test30 nets are much faster:

lc0-v20rc2.exe --cpuct=3.4 --backend=cudnn-fp16 --minibatch-size=512 --weights=weights_run2_32112.pb.gz --nncache=10000000 --threads=3 --smart-pruning-factor=0.000


info depth 18 seldepth 49 time 297643 nodes 14395400 score cp 46 hashfull 379 nps 48364

crem
Posts: 123
Joined: Wed May 23, 2018 7:29 pm

Re: My non-OC RTX 2070 is very fast with Lc0

Post by crem » Mon Dec 24, 2018 9:51 am

Albert Silver wrote:
Mon Dec 24, 2018 3:51 am

No, in the second case there was no GPU usage for sure. Roundrobin is a new multiGPU option I used in v20, and that is remarkably efficient in single-GPU as well. Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
There's no way roundrobin could help in single-GPU case. What roundrobin does is it alternates GPUs on every iteration. As there's just 1 GPU, it doesn't do anything and just forwards all requests to the same backend.

Albert Silver
Posts: 2860
Joined: Wed Mar 08, 2006 8:57 pm
Location: Rio de Janeiro, Brazil

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Albert Silver » Mon Dec 24, 2018 7:44 pm

crem wrote:
Mon Dec 24, 2018 9:51 am
Albert Silver wrote:
Mon Dec 24, 2018 3:51 am

No, in the second case there was no GPU usage for sure. Roundrobin is a new multiGPU option I used in v20, and that is remarkably efficient in single-GPU as well. Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
There's no way roundrobin could help in single-GPU case. What roundrobin does is it alternates GPUs on every iteration. As there's just 1 GPU, it doesn't do anything and just forwards all requests to the same backend.
Ok, thanks for clarifying that. I can only assume the better CPUs are what are boosting the single GPU performance.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."

Javier Ros
Posts: 181
Joined: Fri Oct 12, 2012 10:48 am
Location: Seville (SPAIN)
Full name: Javier Ros

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Javier Ros » Tue Dec 25, 2018 12:43 pm

Albert Silver wrote:
Mon Dec 24, 2018 3:51 am
Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
Your election of parameters

--cpuct=3.4
--smart-pruning-factor=0.000

is better than the default?

--cpuct=3.0
--smart-pruning-factor=1.330000
The love relationship between a chess engine tester and his computer can be summarized in one sentence:
Until heat do us part.

Albert Silver
Posts: 2860
Joined: Wed Mar 08, 2006 8:57 pm
Location: Rio de Janeiro, Brazil

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Albert Silver » Tue Dec 25, 2018 3:00 pm

Javier Ros wrote:
Tue Dec 25, 2018 12:43 pm
Albert Silver wrote:
Mon Dec 24, 2018 3:51 am
Here was my commandline:

lc0-v20rc2.exe --cpuct=3.4 --backend=roundrobin --backend-opts="(backend=cudnn-fp16,gpu=0)" --minibatch-size=512 --weights=11250.pb --nncache=5000000 --threads=3 --smart-pruning-factor=0.000
Your election of parameters

--cpuct=3.4
--smart-pruning-factor=0.000

is better than the default?

--cpuct=3.0
--smart-pruning-factor=1.330000
When testing with a benchmark such as speed or tactics, smart pruning should always be turned off IMHO as this allows the engine to always use the full time. the cpuct at 3.4 is likely stronger for playing though. In a lengthy CLOP. I also converged on a cpuctbase of 53500.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."

User avatar
Laskos
Posts: 9441
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Laskos » Sun Mar 10, 2019 4:52 pm

Maybe after all I will fry the GPU or something :D.

My theoretically hardly overclockable GPU is in fact easy to overclock.
I installed two additional case fans --- very important --- one larger for GPU and one smaller for CPU case fans, and temperatures on long runs decreased by 12-14 Celsius on both CPU and GPU. The GPU temperatures never went over low 50s degrees Celsius, so low, that I decided to overclock my GPU using MSI Afterburner. From base core clock of 1620MHz I went to base 1780MHz, increasing the power limit by 6%. The Voltage is fixed for my GPU (that's why it's not very advisable to overclock by much it), but in two days of continuous long runs, nothing happened, max. GPU temperature was 61 Celsius on a very long full GPU runs (many hours), and everything is very stable. Leela speeds are now almost 10% faster on my RTX 2070. In middlegames in TCEC conditions (and same net, T32930) it churns usually from 27 kNPS to 36 kNPS, with occasional spikes of 40-50 kNPS and higher. That's more than 50% of TCEC middlegame speeds I have seen in games, and they used 2080Ti + 2080. Either their scaling is bad, or their GPUs are throttling due to temperatures or due to power limits. From benchmarks and games I saw, a regular RTX 2080Ti is only some 50% faster than my RTX 2070 now.

I guess that if RTX 2060 overclocks as easily as RTX 2070, maybe it is the best buy, although it's not clear to me that 2x GPUs scale very well (seeing TCEC and CCC).

corres
Posts: 1641
Joined: Wed Nov 18, 2015 10:41 am
Location: hungary

Re: My non-OC RTX 2070 is very fast with Lc0

Post by corres » Mon Apr 22, 2019 8:04 am

Laskos wrote:
Tue Dec 04, 2018 6:04 am
Werewolf wrote:
Mon Dec 03, 2018 1:48 pm
Laskos wrote:
Mon Nov 19, 2018 2:00 pm
Just got and installed it. With one of the latest nets, Lc0 v19 rc5 engine:

UCI commands:

setoption name Backend value cudnn-fp16
setoption name MinibatchSize value 512
setoption name NNCacheSize value 2000000
go

info depth 19 seldepth 52 time 41681 nodes 984582 score cp 27 hashfull 274 nps 23621
info depth 21 seldepth 53 time 69999 nodes 2032430 score cp 26 hashfull 431 nps 29035
info depth 22 seldepth 54 time 93937 nodes 2845554 score cp 26 hashfull 570 nps 30292

Didn't quite expect such speeds, would have been happy even with 18,000-20,000.
Some 5-6 fold improvement over GTX 1060.

My power supply is not that strong (500W), hope it stays well.

I'm not doubting your results Laskos, but I'm struggling to understand them.
Your 1060 card produced about 4.4 TFLOPS FP32. Your 2070 card is around 7.5 TFLOPS FP32 which with the new ability to use FP16 means about 15 TFLOPS.

That should make your 2070 just under 4x faster than your 1060. Instead you report 5-6x improvement.

Happy for you...but confused.
Joshua explained, and it is also explained here:
viewtopic.php?f=2&t=68448&start=44

The speed-up is at least 5 compared to GTX 1060 in almost any condition, and larger than 6 with both in "ideal" conditions.
Test net is ID11261
With GTX 1060 6GB, in ideal settings, I was never getting more than 5100 NPS with it from starting position, but with my RTX 2070, just now setting these values

setoption name Backend value cudnn-fp16
setoption name MinibatchSize value 512
setoption name NNCacheSize value 5000000
setoption name WeightsFile value .\weights_11261.txt.gz

I am getting from initial position:

info depth 19 seldepth 55 time 243657 nodes 8144964 score cp 29 hashfull 599 nps 33427

Which is 6.5x times the maximum speed from initial position for GTX 1060 6GB I got with correct settings.

But I rarely go to 4 min/move in gameplay, only in analysis. Anyway, setting the correct parameters, my RTX 2070 (non-OCed) is about 6 times faster than my GTX 1060 6GB with correct parameters in almost all time and net ID conditions (at least with these 20x256 nets).

I also checked for possible throttling, in 12 hours at full load, temperature is at max 68C, no any throttling in GPU-Z and no any problem with the power supply (a 500W one, but it seems to not complain).

As I said, I myself didn't expect these speeds from RTX 2070, I was happy if 20,000 NPS is achieved in correct conditions. So, I felt compelled to open a thread here.
Reminder 1 for Laskos

corres
Posts: 1641
Joined: Wed Nov 18, 2015 10:41 am
Location: hungary

Re: My non-OC RTX 2070 is very fast with Lc0

Post by corres » Mon Apr 22, 2019 8:06 am

Laskos wrote:
Thu Dec 06, 2018 2:43 pm
Albert Silver wrote:
Thu Dec 06, 2018 1:25 pm
Laskos wrote:
Thu Dec 06, 2018 2:59 am
brianr wrote:
Thu Dec 06, 2018 2:33 am
OK, something seems off.

Why are the 2080 depth 17 nodes so many more than the depth 19 with the 2070?

Maybe I am missing something.
Thanks.
Probably different nets used. But the speed should be fairly uniform with the latter test30 nets, so nps are probably fair to compare.
I used 11250, which I had on hand.
Ok, with this net I am getting:

info depth 16 seldepth 43 time 95727 nodes 2810327 score cp 25 hashfull 643 nps 29357,

so your is about 23% higher. Having about 28% more CUDA cores at 7% higher frequency. In total 37% expected speed-up. It seems memory speed and bandwidth also matter, as those are the same in 2070 and 80. Also, the price is 40% higher. I think the most ineffective would be RTX 2080 Ti, and the most effective a dual RTX 2070.
Reminder 2 for Laskos

User avatar
Laskos
Posts: 9441
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: My non-OC RTX 2070 is very fast with Lc0

Post by Laskos » Mon Apr 22, 2019 8:16 am

corres wrote:
Mon Apr 22, 2019 8:06 am
Laskos wrote:
Thu Dec 06, 2018 2:43 pm
Albert Silver wrote:
Thu Dec 06, 2018 1:25 pm
Laskos wrote:
Thu Dec 06, 2018 2:59 am
brianr wrote:
Thu Dec 06, 2018 2:33 am
OK, something seems off.

Why are the 2080 depth 17 nodes so many more than the depth 19 with the 2070?

Maybe I am missing something.
Thanks.
Probably different nets used. But the speed should be fairly uniform with the latter test30 nets, so nps are probably fair to compare.
I used 11250, which I had on hand.
Ok, with this net I am getting:

info depth 16 seldepth 43 time 95727 nodes 2810327 score cp 25 hashfull 643 nps 29357,

so your is about 23% higher. Having about 28% more CUDA cores at 7% higher frequency. In total 37% expected speed-up. It seems memory speed and bandwidth also matter, as those are the same in 2070 and 80. Also, the price is 40% higher. I think the most ineffective would be RTX 2080 Ti, and the most effective a dual RTX 2070.
Reminder 2 for Laskos
Yes, specialist, using v19 or v20 engine and old 10xxx net format. Also, I am not sure what cache I have used. At least be a bit consistent if putting forward some performance lists. I never posted those, as people use all the different conditions. Larry Kaufman in a matter if 15 minutes reproduced my conditions and we derived that his non-OC 2080 is 21% faster than my non-OC 2070. You seem to need weeks of debates.

Post Reply