SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Laskos · Post by **Laskos** » Fri Aug 14, 2020 8:59 pm

mehmet123 wrote: ↑Fri Aug 14, 2020 8:37 pm
Laskos wrote: ↑Fri Aug 14, 2020 3:04 pm
corres wrote: ↑Fri Aug 14, 2020 2:30 pm
M ANSARI wrote: ↑Fri Aug 14, 2020 11:55 am If this score holds then things look like SF will be ahead even on 8 core against 2 x 2080Ti Lc0. Not sure if your network is the strongest network for SF NN but things are progressing so fast that in 2 to 4 weeks there will probably be some fast gains for SF NN and then things will stabilize. By then it should be a good bit stronger and maybe on 8 core or even 4 cores it can go toe to toe with Lc0 at max hardware. I have a 2080Ti card in my system and it is huge! I can't imagine having 2 or 3 cards in my box !!! The thing is that 32 core is becoming more mainstream and a system with 2 x 2080Ti cards can be better compared to a 64 core system. With 64 cores I think Lc0 has no chance against SF NNUE no matter what the hardware ... unless a major improvement in Lc0 takes place.
The NVIDIA RTX 3090 is under development this will be the major improvement for Leela.
Might also benefit from new backend on new CUDA and cuDNN. NNUE are only good or best for such tactical games like Shogi and Chess, but in Go Alpha Zero like NNs on GPU rule. KataGo is unbelievable strong on my RTX2070 GPU, by now I can blunder check any human Go game, Leela Zero Go games and Alpha Go games. Sometimes KataGo on RTX2070 with many playouts (say half a million) seems stronger than Alpha Zero Go.
Is KataGO commercial, private or free program. If it's commercial or private then which is the most strongest free Go program.

It's completely free, works very well in Lizzie and Sabaki GTP supporting GUIs, and is the strongest free and probably no commercial and private are stronger too (I am not aware of stronger for PC engine + networks). On GitHub:
https://github.com/lightvector/KataGo

mehmet123 · Post by **mehmet123** » Fri Aug 14, 2020 9:02 pm

Thanks for information.

RogerC · Post by **RogerC** » Fri Aug 14, 2020 9:25 pm

corres wrote: ↑Fri Aug 14, 2020 11:02 am In my earlier post I reported my test but as I see it escaped the attention.
It is pity but the title of a post can not be modifiable later so I obliged to repeat the part of my test.

Test 2
NNUE-net = SV-2138
HASH = 2GB
Leela ver.0.25.1 Threads = 6, NNCache=20000000, Backend= Multiplexing,
BackendOptions=(backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)),(backend=cudnn-fp16,gpu=2)
and using Kiudee params
Leela net = SV-384x30-t60-3010

Result 2
SF+NNUE PO 270720 popc 16 cores : kTRIAD = 14 : 6 (180 draw) 200 games
Elo difference ~25 Elo

+25 ELO points for NNUE, an engine that consume 50 watts of CPU (that's what i mesured on my PC with NNUE engine) vs 200 watts (CPU + GPU) !

And size difference of nets are worse : 20MB NNUE nets vs 160MB SV-384x30-t60-3010 LEELA net. NNUE nets are way more optimized ! NNUE nets are just beginning to learn... 1 year of learning and NNUE nets will be unbeatable.

LEELA (and LC0) are "finished" and out of date.

M ANSARI · Post by **M ANSARI** » Sat Aug 15, 2020 4:42 am

RogerC wrote: ↑Fri Aug 14, 2020 9:25 pm
corres wrote: ↑Fri Aug 14, 2020 11:02 am In my earlier post I reported my test but as I see it escaped the attention.
It is pity but the title of a post can not be modifiable later so I obliged to repeat the part of my test.

Test 2
NNUE-net = SV-2138
HASH = 2GB
Leela ver.0.25.1 Threads = 6, NNCache=20000000, Backend= Multiplexing,
BackendOptions=(backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)),(backend=cudnn-fp16,gpu=2)
and using Kiudee params
Leela net = SV-384x30-t60-3010

Result 2
SF+NNUE PO 270720 popc 16 cores : kTRIAD = 14 : 6 (180 draw) 200 games
Elo difference ~25 Elo
+25 ELO points for NNUE, an engine that consume 50 watts of CPU (that's what i mesured on my PC with NNUE engine) vs 200 watts (CPU + GPU) !

And size difference of nets are worse : 20MB NNUE nets vs 160MB SV-384x30-t60-3010 LEELA net. NNUE nets are way more optimized ! NNUE nets are just beginning to learn... 1 year of learning and NNUE nets will be unbeatable.

LEELA (and LC0) are "finished" and out of date.

Lol !!! I remember people saying that about AB engines not too long ago

Jhoravi · Post by **Jhoravi** » Sat Aug 15, 2020 5:15 am

RogerC wrote: ↑Fri Aug 14, 2020 9:25 pm And size difference of nets are worse : 20MB NNUE nets vs 160MB SV-384x30-t60-3010 LEELA net.

How is the 20MB NNUE decided? Is 20MB an optimal size or can it be set bigger like 100MB when faster CPU's come?

RogerC · Post by **RogerC** » Sat Aug 15, 2020 5:49 am

Jhoravi wrote: ↑Sat Aug 15, 2020 5:15 am
How is the 20MB NNUE decided? Is 20MB an optimal size or can it be set bigger like 100MB when faster CPU's come?

Most of the net should be in CPU cache so that the CPU will evaluate millions of positions /sec, which is the case actually, with the last Intel or AMD processors. With 160 MB Leela nets will be fast when processors will have at least 100 MB cache.

Jhoravi · Post by **Jhoravi** » Sat Aug 15, 2020 6:15 am

RogerC wrote: ↑Sat Aug 15, 2020 5:49 am Most of the net should be in CPU cache so that the CPU will evaluate millions of positions /sec, which is the case actually, with the last Intel or AMD processors. With 160 MB Leela nets will be fast when processors will have at least 100 MB cache.

You mean the 20MB is already for high end PC? How about my laptop with only 8MB L3 cache?

RogerC · Post by **RogerC** » Sat Aug 15, 2020 7:07 am

Jhoravi wrote: ↑Sat Aug 15, 2020 6:15 am
RogerC wrote: ↑Sat Aug 15, 2020 5:49 am Most of the net should be in CPU cache so that the CPU will evaluate millions of positions /sec, which is the case actually, with the last Intel or AMD processors. With 160 MB Leela nets will be fast when processors will have at least 100 MB cache.
You mean the 20MB is already for high end PC? How about my laptop with only 8MB L3 cache?

It will run much faster than Leela nets, like in my Snapdragon 855 with ~3MB cache i have about 1-2 millions positions /sec. So with 8MB cache it's fine. The evaluation of positions with nets need millions of back and forth in Ram. So the more cache in processor the more heavy vector calculations will be made.

Milos · Post by **Milos** » Mon Aug 17, 2020 1:44 am

corres wrote: ↑Fri Aug 14, 2020 5:20 pm
Werewolf wrote: ↑Fri Aug 14, 2020 3:50 pm
corres wrote: ↑Fri Aug 14, 2020 2:30 pm
M ANSARI wrote: ↑Fri Aug 14, 2020 11:55 am If this score holds then things look like SF will be ahead even on 8 core against 2 x 2080Ti Lc0. Not sure if your network is the strongest network for SF NN but things are progressing so fast that in 2 to 4 weeks there will probably be some fast gains for SF NN and then things will stabilize. By then it should be a good bit stronger and maybe on 8 core or even 4 cores it can go toe to toe with Lc0 at max hardware. I have a 2080Ti card in my system and it is huge! I can't imagine having 2 or 3 cards in my box !!! The thing is that 32 core is becoming more mainstream and a system with 2 x 2080Ti cards can be better compared to a 64 core system. With 64 cores I think Lc0 has no chance against SF NNUE no matter what the hardware ... unless a major improvement in Lc0 takes place.
The NVIDIA RTX 3090 is under development this will be the major improvement for Leela.
But how much? Some estimates for Lc0 are as low as +19% speedup, which won't be enough
NVIDIA develops not only new GPUs but new software for their GPUs. RTX 2000 series GPUs yield ~30 %
speedup using CUDA11 and cudnn 7.6.4.
The enhancement of Leela`s effectiveness using more than two GPUs is also a possibility.
In general the greater net of Leela give more exact evaluation than the small net of NNUE, I think.

On 2070 I get only 15% speedup of CUDA 11+cudnn 7.6.4 vs CUDA 10.2+cudnn 7.6.4. So no idea how you are getting +30%.
Regaring 3090 based on current specs speeeup in classical GEMM operations will be minimal ~10% and speedup in tensor cores is only usable for CNNs with 7x7 convolutions in input layer (or convolutions higher than 4x4), which doesn't give any benefit to inference of Leela nets.
Frankly, I'll be quite surprised if 3090 vs 2080Ti both running CUDA 11, brings more than 15%.
OTOH with upcoming Zen 3 chips speedup will be for sure more than that (like R9 4950X running at 5GHz).

corres · Post by **corres** » Mon Aug 17, 2020 10:28 am

Milos wrote: ↑Mon Aug 17, 2020 1:44 am
corres wrote: ↑Fri Aug 14, 2020 5:20 pm
Werewolf wrote: ↑Fri Aug 14, 2020 3:50 pm
corres wrote: ↑Fri Aug 14, 2020 2:30 pm
M ANSARI wrote: ↑Fri Aug 14, 2020 11:55 am If this score holds then things look like SF will be ahead even on 8 core against 2 x 2080Ti Lc0. Not sure if your network is the strongest network for SF NN but things are progressing so fast that in 2 to 4 weeks there will probably be some fast gains for SF NN and then things will stabilize. By then it should be a good bit stronger and maybe on 8 core or even 4 cores it can go toe to toe with Lc0 at max hardware. I have a 2080Ti card in my system and it is huge! I can't imagine having 2 or 3 cards in my box !!! The thing is that 32 core is becoming more mainstream and a system with 2 x 2080Ti cards can be better compared to a 64 core system. With 64 cores I think Lc0 has no chance against SF NNUE no matter what the hardware ... unless a major improvement in Lc0 takes place.
The NVIDIA RTX 3090 is under development this will be the major improvement for Leela.
But how much? Some estimates for Lc0 are as low as +19% speedup, which won't be enough
NVIDIA develops not only new GPUs but new software for their GPUs. RTX 2000 series GPUs yield ~30 %
speedup using CUDA11 and cudnn 7.6.4.
The enhancement of Leela`s effectiveness using more than two GPUs is also a possibility.
In general the greater net of Leela give more exact evaluation than the small net of NNUE, I think.
On 2070 I get only 15% speedup of CUDA 11+cudnn 7.6.4 vs CUDA 10.2+cudnn 7.6.4. So no idea how you are getting +30%.
Regaring 3090 based on current specs speeeup in classical GEMM operations will be minimal ~10% and speedup in tensor cores is only usable for CNNs with 7x7 convolutions in input layer (or convolutions higher than 4x4), which doesn't give any benefit to inference of Leela nets.
Frankly, I'll be quite surprised if 3090 vs 2080Ti both running CUDA 11, brings more than 15%.
OTOH with upcoming Zen 3 chips speedup will be for sure more than that (like R9 4950X running at 5GHz).

Maybe you use a small (20x256) net, in this case the speedup is ~5-10 % only.
But in the case of big net (384x30) the speedup is ~ 30 %. Try it again.
I used CUDA11 + cudnn 7.6.4 combination against CUDA10.2 + 7.4.2 (default Leela).
We can hope developers of Leela will optimize source of Leela for NVIDIA series 3000.

SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela

Re: SF+NNUE 16 cores aganst Stockfish-dev 16 cores and Leela