Distilled Networks for Lc0

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: Distilled Networks for Lc0

Post by dkappe »

Werewolf wrote: Fri Jan 18, 2019 10:08 pm What do you mean by “distilled”?
You use one network to train another (usually smaller) network, rather than being trained by self-play or supervised learning data.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: Distilled Networks for Lc0

Post by dkappe »

Werner wrote: Sun Jan 20, 2019 10:52 am Hi Dietrich,
thanks a lot for your work.

When I compare your list with our CEGT results I see a difference with Crafty. Does it run on 1CPU?

1 ethereal : 3262 35 (CEGT 3187)
2 ID11258-112x9-se : 2965
3 crafty25.2 : 2948 26 (CEGT 2790)
Everything runs on 1 CPU.

Although on CCRL, crafty 25.2 is at 3057.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Distilled Networks for Lc0

Post by corres »

dkappe wrote: Sat Jan 26, 2019 11:18 pm
Werewolf wrote: Fri Jan 18, 2019 10:08 pm What do you mean by “distilled”?
You use one network to train another (usually smaller) network, rather than being trained by self-play or supervised learning data.
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?
User avatar
AdminX
Posts: 6340
Joined: Mon Mar 13, 2006 2:34 pm
Location: Acworth, GA

Re: Distilled Networks for Lc0

Post by AdminX »

corres wrote: Sun Jan 27, 2019 10:16 am
dkappe wrote: Sat Jan 26, 2019 11:18 pm
Werewolf wrote: Fri Jan 18, 2019 10:08 pm What do you mean by “distilled”?
You use one network to train another (usually smaller) network, rather than being trained by self-play or supervised learning data.
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?
Allot of his test results are posted in Discord (https://discordapp.com/), but there is a sample posted here. https://github.com/dkappe/leela-chess-w ... d-Networks
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Distilled Networks for Lc0

Post by corres »

AdminX wrote: Sun Jan 27, 2019 11:05 am
corres wrote: Sun Jan 27, 2019 10:16 am
dkappe wrote: Sat Jan 26, 2019 11:18 pm
Werewolf wrote: Fri Jan 18, 2019 10:08 pm What do you mean by “distilled”?
You use one network to train another (usually smaller) network, rather than being trained by self-play or supervised learning data.
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?
Allot of his test results are posted in Discord (https://discordapp.com/), but there is a sample posted here. https://github.com/dkappe/leela-chess-w ... d-Networks
Thanks, but I should like to read the personal opinion of Mr.Kappe.
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: Distilled Networks for Lc0

Post by dkappe »

corres wrote: Sun Jan 27, 2019 10:16 am
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?
See the distilled network page for an extensive tournament testing various sizes. On gpu, the bigger the network, the stronger it plays. On cpu, it’s a trade off between speed and smarts.

https://github.com/dkappe/leela-chess-w ... d-Networks
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Distilled Networks for Lc0

Post by corres »

dkappe wrote: Sun Jan 27, 2019 6:58 pm
corres wrote: Sun Jan 27, 2019 10:16 am
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?
See the distilled network page for an extensive tournament testing various sizes. On gpu, the bigger the network, the stronger it plays. On cpu, it’s a trade off between speed and smarts.

https://github.com/dkappe/leela-chess-w ... d-Networks
OK, thanks.
But about the lost of information during the process there is no any data/opinion.
I think in the case of GPUs the playing power also a trade off between speed and knowledge.
The recent 20x256 net size is not enough to overcharge an RTX 2080Ti.
Maybe a 40x256 can do it.
I think you know well the structure of Leela NN.
A question: Leela NN what we can download contains the policy head/net too?
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: Distilled Networks for Lc0

Post by dkappe »

corres wrote: Mon Jan 28, 2019 12:04 am
OK, thanks.
But about the lost of information during the process there is no any data/opinion.
I think in the case of GPUs the playing power also a trade off between speed and knowledge.
The recent 20x256 net size is not enough to overcharge an RTX 2080Ti.
Maybe a 40x256 can do it.
I think you know well the structure of Leela NN.
A question: Leela NN what we can download contains the policy head/net too?
There’s lots of opinion on knowledge distillation and academic papers to boot. Here’s a layperson appropriate article for starters: https://medium.com/neural-machines/know ... 241d7c2322

You’re right. There isn’t unlimited headroom on a gpu, but no one has trained a network big enough to find that spot yet.

All networks have value and policy heads. The lc0 search needs and expects them to have those.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Distilled Networks for Lc0

Post by corres »

dkappe wrote: Mon Jan 28, 2019 2:11 am There’s lots of opinion on knowledge distillation and academic papers to boot. Here’s a layperson appropriate article for starters: https://medium.com/neural-machines/know ... 241d7c2322
OK, but these papers do not tell us your own works and the results depend on what you actually did. So I think an interpretation of your works would be needed.
dkappe wrote: Mon Jan 28, 2019 2:11 am You’re right. There isn’t unlimited headroom on a gpu, but no one has trained a network big enough to find that spot yet.
All networks have value and policy heads. The lc0 search needs and expects them to have those.
Naturally LC0 has the both head. But there are NN with separated value and policy head and there are NN in which the both are in the same structure.
It is pity that although LC0 is an open project yet its developers give us a very desultory and defective write down about LC0. Maybe they follow the precedent of Google Team?
If you only "expect" something about network of LC0 who is the man who know what is the truth?
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: Distilled Networks for Lc0

Post by dkappe »

corres wrote: Mon Jan 28, 2019 9:00 am
OK, but these papers do not tell us your own works and the results depend on what you actually did. So I think an interpretation of your works would be needed.

Naturally LC0 has the both head. But there are NN with separated value and policy head and there are NN in which the both are in the same structure.
It is pity that although LC0 is an open project yet its developers give us a very desultory and defective write down about LC0. Maybe they follow the precedent of Google Team?
If you only "expect" something about network of LC0 who is the man who know what is the truth?
I used the following branch of the lczero training code.

https://github.com/Ttl/lczero-training/tree/distill

As far explaining the code or the network architecture (beyond what’s already been written by the developers), I’m not the man.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".