Dual RTX 2060 for Leela

corres · Post by **corres** » Mon Apr 22, 2019 10:39 am

I made some new tests with Net 11250, Net 32930, Net 42033 based on the parameters of Albert Silver.
System: Ryzen7 1800x 8x4000 MHz, 2xRTX 2060 OC
Common parameters:
threads for GPU=2 threads/GPU
backend=cudnn-fp16 (multiplexing)
minibatchsize=512
nncachesize=5000000
cpuct=3.4
smartpruningfactor=0.0

Test 1 - Net 11250
GPU1
max nps = 30800 (depth 13 nodes 88844854)
GPU2
max nps = 27500 (depth 13 nodes 9891986)
DUAL GPU
max nps = 55960 (depth 13 nodes 8446664)
Effectiveness 96 %

Test 2 - Net 32930
GPU1
max nps = 25690 (depth 25 nodes 6837540)
GPU2
max nps = 22620 (depth 24 nodes 5968236)
DUAL GPU
max nps = 45550 (depth 24 nodes 5757501)
Effectiveness 94 %

Test 3 - Net 42033
GPU1
max nps = 34670 (depth 19 nodes 14609585 )
GPU2
max nps = 32440 (depth 19 nodes 13737343)
DUAL GPU
max nps = 52350 (depth 19 nodes 13209499)
Effectiveness 78 % (!)

As we can see in the case of Net 42033 the effectiveness of dual GPU drastically dropped down.
Maybe this is caused by the weaker hardware of RTX 2060.
It would be known how work the RTX 2800 Ti + RTX 2080 duo.

corres · Post by **corres** » Mon Apr 22, 2019 10:53 am

Laskos wrote: ↑Mon Apr 22, 2019 10:12 am Also, forgot to mention, those results in your list are for different engines, ...

This is the only point in which you are right.
But the position of GPUs in the list are independent of the Leela version.
Only the concrete nps value may change.
Every parameters can be found in the "My non-OC RTX 2070... " posts.
Please, read Reminder 1, Reminder 2, etc.
(Na ja, the pills...)

A note
I do not like the bla-bla in a technical writing.

Laskos · Post by **Laskos** » Mon Apr 22, 2019 11:03 am

corres wrote: ↑Mon Apr 22, 2019 10:53 am
Laskos wrote: ↑Mon Apr 22, 2019 10:12 am Also, forgot to mention, those results in your list are for different engines, ...
This is the only point in which you are right.
But the position of GPUs in the list are independent of the Leela version.
Only the concrete nps value may change.
Every parameters can be found in the "My non-OC RTX 2070... " posts.
Please, read Reminder 1, Reminder 2, etc.
(Na ja, the pills...)

A note
I do not like the bla-bla in a technical writing.

Ok, hopeless. Just post your lists for other confused people to be even more confused.

jp · Post by jp » Mon Apr 22, 2019 1:12 pm

chrisw wrote: ↑Sat Apr 20, 2019 6:05 pm
smatovic wrote: ↑Sat Apr 20, 2019 11:04 am
chrisw wrote: ↑Sat Apr 20, 2019 1:32 am What is a node in lc0? This is not a stupid question, btw. Could be every node in the tree. Could be every node in the tree plus every node outside the tree. If it’s a tree node, does it count 1 for every time the search goes through, or just the once when it is visited the first time? Could be the count of discrete NN lookups. I kind of have been assuming it’s a “normal” computer chess node, ticked up for every “move”, but maybe not?
I asked Ankan once, he told me that for nps LC0 counts the NN eval calls made by search,
this includes Policy/Value and NN cache hits.
So in-tree nodes are not counted twice. Is normal in AB to count those, every step in the tree counts. So, in reality, compared to AB nodecounter, LC0 is undercounting by, let me work it out ... a factor of average branch length minus one. What’s average branch length? More as more search, maybe five to ten? Guessing.

So Lc0 is counting only leaf nodes per second?

Do they do that because they think A0 did that too? Did A0 do that too?

corres · Post by **corres** » Mon Apr 22, 2019 5:57 pm

I tested some destilled/boosted (dkappe) networks.
The common parameters are:
System: Ryzen7 1800x 8x4000 MHz
LC0 version: 0.21.1
Backend: cudnn-fp16 (multiplexing)
Minibatchsize=512
NNcachesize=5000000
Cpuct=3.4
Smartpruningfactor=0.0
Other parameters are default

Test 1 - Net 11258-120x10se
GPU1 max nps = 26480 (depth 9 nodes 4180266)
GPU2 max nps = 17596 (depth 9 nodes 4812847)
DUAL max nps = 30689 (depth 9 nodes 2250770)
This net behaves very irregularly, maybe its structure is distorted and/or Leela is not optimized for it.

Test 2 - Net 32930-boost-7000
GPU1 max nps = 32092 (depth 15 nodes 14909382)
GPU2 max nps = 28975 (depth 15 nodes 12869667)
DUAL max nps = 54918 (depth 15 nodes 12546035)
Effectiveness ~90 %.
Comparing to the default Net 32930 this "boost" net is about 20 % faster.

Test 3 Net 41800-boost-7000
GPU1 max nps = 28290 (depth 16 nodes 11972071)
GPU2 max nps = 25809 (depth 16 nodes 14590701)
DUAL max nps = 50523 (depth 16 nodes 12991517)
Effectiveness ~91 %
Comparing to the default Net 42033 this "boost" net is about 20 % slower.

Albert Silver · Post by **Albert Silver** » Mon Apr 22, 2019 9:36 pm

Laskos wrote: ↑Mon Apr 22, 2019 10:12 am
Laskos wrote: ↑Sun Apr 21, 2019 10:45 pm
corres wrote: ↑Sun Apr 21, 2019 7:15 pm
corres wrote: ↑Sat Apr 20, 2019 1:06 am
Basing on the common test data before we can make a list of RTX 2000 line GPUs.
The common parameters are:
NET: 11250
Backend: cudnn-fp16
Minibatchsize: 512
NNcachesize: 2000000
Other parameters are default
The list:
RTX 2060 OC max nps = 28646 (corres)
RTX 2070 non-OC max nps = 29357 (Laskos)
RTX 2080 max nps = 36300 (Albert Silver)
RTX 2080 Ti max nps = 43297 (Albert Silver)
DUAL RTX 2060 OC max nps = 53789 (corres)
RTX 2080 Ti + RTX 2080 max nps = 77435 (Albert Silver)
You didn't specify the time or nodes at which those were measured, and I don't remember using NN cache of 2000000. So, it's probably another useless list for Leela, as are most circulating around. Again, could you specify all the parameters, as I posted since the start of this thread:

setoption name Backend value cudnn-fp16
setoption name MinibatchSize value 512
setoption name NNCacheSize value 10000000
setoption name WeightsFile value .\11250.txt.gz
go

My now OC-ed 2070 GPU with 2 threads of 3.8GHz i7 CPU gives:

info depth 14 seldepth 43 time 283565 nodes 10050124 score cp 23 hashfull 426 nps 35442 tbhits 0 pv d2d4
after 10 million nodes

info depth 16 seldepth 50 time 409072 nodes 15183130 score cp 25 hashfull 606 nps 37116 tbhits 0 pv d2d4
after 15 million nodes

LTC like 5 minutes is probably better than very short runs, and hashfull is better to be about half. Even longer TC (say 15 minutes) would be good for checking the throttling. TCEC14 Leela machine (an i5) seemed to me to suffer, I guess many have problems over long runs.
Also, forgot to mention, those results in your list are for different engines, from v19.0 to v21.1, and depending on net formats, it can make a significant difference. You are using a demoted net format of 10xxx run, while the reference for long time from now is and will be 40xxx format. The latest nets of the 40xxx are by a significant margin the best ones, even before the last LR drop. My inference is that test 40 is probably already above AlphaZero level, maybe I will open a thread about that.
So, I think the benchmarks now should use v21.1 engine and a test 40 net. I proposed one at the beginning of this thread, but yoy stuck to your weird arguments and now you post useless performance lists.

Just a sidenote: the NNcache can lead to somewhat misleading nodes per second counts. If you benchmark with a large value, and it does not fill it before the end, you will get significant NPS results, but they will not represent your actual speed throughout the game. Once filled, the NPS will drop a lot, and that is actually the speed you will get for the rest of a game.

corres · Post by **corres** » Tue Apr 23, 2019 9:05 am

Albert Silver wrote: ↑Mon Apr 22, 2019 9:36 pm Just a sidenote: the NNcache can lead to somewhat misleading nodes per second counts. If you benchmark with a large value, and it does not fill it before the end, you will get significant NPS results, but they will not represent your actual speed throughout the game. Once filled, the NPS will drop a lot, and that is actually the speed you will get for the rest of a game.

During the tests there were some cases when Hasfull reached 1000. In these cases the value of nps stopped to grow but the nps was not reduced.
I tried NNCacheSize=20000000 but lc0 got fried.
I do not know the developers of LC0 why set such a low value of nncachesize for default.
Have you any information about the dimension of nncachesize and minibatchsize?
And where they are: in the RAM or in the VRAM?
Developers of LC0 back some information from us, what is a pity thing.

Laskos · Post by **Laskos** » Tue Apr 23, 2019 10:33 am

Albert Silver wrote: ↑Mon Apr 22, 2019 9:36 pm
Laskos wrote: ↑Mon Apr 22, 2019 10:12 am
Laskos wrote: ↑Sun Apr 21, 2019 10:45 pm
corres wrote: ↑Sun Apr 21, 2019 7:15 pm
corres wrote: ↑Sat Apr 20, 2019 1:06 am
Basing on the common test data before we can make a list of RTX 2000 line GPUs.
The common parameters are:
NET: 11250
Backend: cudnn-fp16
Minibatchsize: 512
NNcachesize: 2000000
Other parameters are default
The list:
RTX 2060 OC max nps = 28646 (corres)
RTX 2070 non-OC max nps = 29357 (Laskos)
RTX 2080 max nps = 36300 (Albert Silver)
RTX 2080 Ti max nps = 43297 (Albert Silver)
DUAL RTX 2060 OC max nps = 53789 (corres)
RTX 2080 Ti + RTX 2080 max nps = 77435 (Albert Silver)
You didn't specify the time or nodes at which those were measured, and I don't remember using NN cache of 2000000. So, it's probably another useless list for Leela, as are most circulating around. Again, could you specify all the parameters, as I posted since the start of this thread:

setoption name Backend value cudnn-fp16
setoption name MinibatchSize value 512
setoption name NNCacheSize value 10000000
setoption name WeightsFile value .\11250.txt.gz
go

My now OC-ed 2070 GPU with 2 threads of 3.8GHz i7 CPU gives:

info depth 14 seldepth 43 time 283565 nodes 10050124 score cp 23 hashfull 426 nps 35442 tbhits 0 pv d2d4
after 10 million nodes

info depth 16 seldepth 50 time 409072 nodes 15183130 score cp 25 hashfull 606 nps 37116 tbhits 0 pv d2d4
after 15 million nodes

LTC like 5 minutes is probably better than very short runs, and hashfull is better to be about half. Even longer TC (say 15 minutes) would be good for checking the throttling. TCEC14 Leela machine (an i5) seemed to me to suffer, I guess many have problems over long runs.
Also, forgot to mention, those results in your list are for different engines, from v19.0 to v21.1, and depending on net formats, it can make a significant difference. You are using a demoted net format of 10xxx run, while the reference for long time from now is and will be 40xxx format. The latest nets of the 40xxx are by a significant margin the best ones, even before the last LR drop. My inference is that test 40 is probably already above AlphaZero level, maybe I will open a thread about that.
So, I think the benchmarks now should use v21.1 engine and a test 40 net. I proposed one at the beginning of this thread, but yoy stuck to your weird arguments and now you post useless performance lists.
Just a sidenote: the NNcache can lead to somewhat misleading nodes per second counts. If you benchmark with a large value, and it does not fill it before the end, you will get significant NPS results, but they will not represent your actual speed throughout the game. Once filled, the NPS will drop a lot, and that is actually the speed you will get for the rest of a game.

Yes, I am aware of that, this is just a benchmark. I observed that during the game, _average_ speed is about 15% lower than the benchmark at same time used (LTC for this benchmark), if setting similar large in-game NNcache. There are some high and lows in speed in-game I don't understand, I am talking of average (better to say "median") speed.
Using very tiny NNcache OTOH can diminish the speeds by a factor of 2 both in benchmark and in-game, and is to be avoided.

corres · Post by **corres** » Wed Apr 24, 2019 3:47 pm

I made some test to investigate how depends the nps value on the GPU core frequency.
GPU: Gigabyte RTX 2060 Windforce OC (single GPU)
Nominal (OC) GPU core frequency: 1770 MHz
NVIDIA driver version number: 425.31
System: Ryzen7 1800x 8x4000 MHz, Windows 10 64 bits (2019 Oct)
GPU tweaker: ASUS GPU Tweak 2
LC0 parameters:
NET-file=11250
Backend=cudnn-fp16
NNCachesIze=10000000
SmartPruningFactor=0.0
Other parameters are default

Test 1
More OC (+100 MHz)
Nominal GPU frequency: 1870 MHZ, Max power 117 %
max nps = 31285 (depth 13 nodes 6790834)
Working temparature was < 80 Celsius

Test 2
(Default OC) GPU frequency: 1770 MHz
max nps = 29225 (depth 13 nodes 6471301)
Working temperature was < 70 Celsius

Test 3
Lowered (-100 MHZ) GPU frequency: 1670 MHz
max nps =27693 (depth 13 nodes 6379993)

Test 4
Lowered (-200 MHz) GPU frequency: 1570 MHz
max nps = 26694 (depth 13 nodes 6886672)

That is:
1, The dependency is non-linear
2, 300 MHz changing in GPU (~17 %) core frequency yields 5800 of nps (~20 %) only.

Note
1. MSI GPU tweaker did not work, Gigabyte GPU tweaker works but it was very unstable only the ASUS tool works well.
2. The real GPU core frequencies were higher with ~50 MHz than the nominal values maybe because of throttling.

chrisw · Post by **chrisw** » Wed Apr 24, 2019 5:32 pm

corres wrote: ↑Tue Apr 23, 2019 9:05 am
Albert Silver wrote: ↑Mon Apr 22, 2019 9:36 pm Just a sidenote: the NNcache can lead to somewhat misleading nodes per second counts. If you benchmark with a large value, and it does not fill it before the end, you will get significant NPS results, but they will not represent your actual speed throughout the game. Once filled, the NPS will drop a lot, and that is actually the speed you will get for the rest of a game.
During the tests there were some cases when Hasfull reached 1000. In these cases the value of nps stopped to grow but the nps was not reduced.
I tried NNCacheSize=20000000 but lc0 got fried.
I do not know the developers of LC0 why set such a low value of nncachesize for default.
Have you any information about the dimension of nncachesize and minibatchsize?
And where they are: in the RAM or in the VRAM?
Developers of LC0 back some information from us, what is a pity thing.

I would assume what is happening (I didn’t implement hash yet), is that if a new node matches with a hash table entry, the new node gets given the win and visit count from the hash, and backs that (win,visits) entry back to the root. That root move gets a higher visit count as a result even though it didn’t actually make the visits.
But, the line that put the original data (wins, visits) into the hash was also backed up at the time, and its (wins,visits) will also be represented at the root by its root move.
Then LC0 computes node-count by adding up the root move visits. Hence the double counting. Anyway, if this theory is correct, a non-doubles node-count could be done at the point of summing the root visits, by subtracting the given hash visits. nodecount=oldnodecount minus hashhits.
So, why in this scenario does the double counting stop when hash is full? Because the hash entries are still there and you’ld assume other hits from other nodes will still take place? Mystery.

Another thought, when backpropagating (wins,visits), if a node on the backprop path is in the hash table, presumably the hash entry should get updated at the same time.

Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela

Re: Dual RTX 2060 for Leela