Distilled Networks for Lc0

dkappe · Post by **dkappe** » Sat Jan 26, 2019 11:18 pm

Werewolf wrote: ↑Fri Jan 18, 2019 10:08 pm What do you mean by “distilled”?

You use one network to train another (usually smaller) network, rather than being trained by self-play or supervised learning data.

dkappe · Post by **dkappe** » Sat Jan 26, 2019 11:26 pm

Werner wrote: ↑Sun Jan 20, 2019 10:52 am Hi Dietrich,
thanks a lot for your work.

When I compare your list with our CEGT results I see a difference with Crafty. Does it run on 1CPU?

1 ethereal : 3262 35 (CEGT 3187)
2 ID11258-112x9-se : 2965
3 crafty25.2 : 2948 26 (CEGT 2790)

Everything runs on 1 CPU.

Although on CCRL, crafty 25.2 is at 3057.

corres · Post by **corres** » Sun Jan 27, 2019 10:16 am

dkappe wrote: ↑Sat Jan 26, 2019 11:18 pm
Werewolf wrote: ↑Fri Jan 18, 2019 10:08 pm What do you mean by “distilled”?
You use one network to train another (usually smaller) network, rather than being trained by self-play or supervised learning data.

When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?

AdminX · Post by **AdminX** » Sun Jan 27, 2019 11:05 am

corres wrote: ↑Sun Jan 27, 2019 10:16 am
dkappe wrote: ↑Sat Jan 26, 2019 11:18 pm
Werewolf wrote: ↑Fri Jan 18, 2019 10:08 pm What do you mean by “distilled”?
You use one network to train another (usually smaller) network, rather than being trained by self-play or supervised learning data.
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?

Allot of his test results are posted in Discord (https://discordapp.com/), but there is a sample posted here. https://github.com/dkappe/leela-chess-w ... d-Networks

corres · Post by **corres** » Sun Jan 27, 2019 11:18 am

AdminX wrote: ↑Sun Jan 27, 2019 11:05 am
corres wrote: ↑Sun Jan 27, 2019 10:16 am
dkappe wrote: ↑Sat Jan 26, 2019 11:18 pm
Werewolf wrote: ↑Fri Jan 18, 2019 10:08 pm What do you mean by “distilled”?
You use one network to train another (usually smaller) network, rather than being trained by self-play or supervised learning data.
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?
Allot of his test results are posted in Discord (https://discordapp.com/), but there is a sample posted here. https://github.com/dkappe/leela-chess-w ... d-Networks

Thanks, but I should like to read the personal opinion of Mr.Kappe.

dkappe · Post by **dkappe** » Sun Jan 27, 2019 6:58 pm

corres wrote: ↑Sun Jan 27, 2019 10:16 am
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?

See the distilled network page for an extensive tournament testing various sizes. On gpu, the bigger the network, the stronger it plays. On cpu, it’s a trade off between speed and smarts.

https://github.com/dkappe/leela-chess-w ... d-Networks

corres · Post by **corres** » Mon Jan 28, 2019 12:04 am

dkappe wrote: ↑Sun Jan 27, 2019 6:58 pm
corres wrote: ↑Sun Jan 27, 2019 10:16 am
When you "distill" a network to get a smaller and faster NN it may lost some information.
What is your experience about it?
Did you make tests between original and "distilled" NN?
How faster the "distilled" NN-s are?
See the distilled network page for an extensive tournament testing various sizes. On gpu, the bigger the network, the stronger it plays. On cpu, it’s a trade off between speed and smarts.

https://github.com/dkappe/leela-chess-w ... d-Networks

OK, thanks.
But about the lost of information during the process there is no any data/opinion.
I think in the case of GPUs the playing power also a trade off between speed and knowledge.
The recent 20x256 net size is not enough to overcharge an RTX 2080Ti.
Maybe a 40x256 can do it.
I think you know well the structure of Leela NN.
A question: Leela NN what we can download contains the policy head/net too?

dkappe · Post by **dkappe** » Mon Jan 28, 2019 2:11 am

corres wrote: ↑Mon Jan 28, 2019 12:04 am
OK, thanks.
But about the lost of information during the process there is no any data/opinion.
I think in the case of GPUs the playing power also a trade off between speed and knowledge.
The recent 20x256 net size is not enough to overcharge an RTX 2080Ti.
Maybe a 40x256 can do it.
I think you know well the structure of Leela NN.
A question: Leela NN what we can download contains the policy head/net too?

There’s lots of opinion on knowledge distillation and academic papers to boot. Here’s a layperson appropriate article for starters: https://medium.com/neural-machines/know ... 241d7c2322

You’re right. There isn’t unlimited headroom on a gpu, but no one has trained a network big enough to find that spot yet.

All networks have value and policy heads. The lc0 search needs and expects them to have those.

corres · Post by **corres** » Mon Jan 28, 2019 9:00 am

dkappe wrote: ↑Mon Jan 28, 2019 2:11 am There’s lots of opinion on knowledge distillation and academic papers to boot. Here’s a layperson appropriate article for starters: https://medium.com/neural-machines/know ... 241d7c2322

OK, but these papers do not tell us your own works and the results depend on what you actually did. So I think an interpretation of your works would be needed.

dkappe wrote: ↑Mon Jan 28, 2019 2:11 am You’re right. There isn’t unlimited headroom on a gpu, but no one has trained a network big enough to find that spot yet.
All networks have value and policy heads. The lc0 search needs and expects them to have those.

Naturally LC0 has the both head. But there are NN with separated value and policy head and there are NN in which the both are in the same structure.
It is pity that although LC0 is an open project yet its developers give us a very desultory and defective write down about LC0. Maybe they follow the precedent of Google Team?
If you only "expect" something about network of LC0 who is the man who know what is the truth?

dkappe · Post by **dkappe** » Mon Jan 28, 2019 3:13 pm

corres wrote: ↑Mon Jan 28, 2019 9:00 am
OK, but these papers do not tell us your own works and the results depend on what you actually did. So I think an interpretation of your works would be needed.

Naturally LC0 has the both head. But there are NN with separated value and policy head and there are NN in which the both are in the same structure.
It is pity that although LC0 is an open project yet its developers give us a very desultory and defective write down about LC0. Maybe they follow the precedent of Google Team?
If you only "expect" something about network of LC0 who is the man who know what is the truth?

I used the following branch of the lczero training code.

https://github.com/Ttl/lczero-training/tree/distill

As far explaining the code or the network architecture (beyond what’s already been written by the developers), I’m not the man.

Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0

Re: Distilled Networks for Lc0