Leela Chess is benefiting from much of the work done for Leela Go.
Net2net is one of the things used by Leela Go that I think has been used for Chess also.
The paper is here:
https://arxiv.org/abs/1511.05641
It looks like it uses the current smaller net weights to bootstrap the new larger net.
I think a training step using sample data is then still needed, but it learns faster instead of using random weights to start with.
You can search GitHub repos.
If you search the Leela Go repo for net2net there is much more than in the Leela Chess repo.
For more info than I could hope to provide, you could ask in the Leela Chess forum (url tags don't seem to work for this one):
https://groups.google.com/forum/#!forum/lczero
MCTS beginner questions
Moderator: Ras
-
brianr
- Posts: 540
- Joined: Thu Mar 09, 2006 3:01 pm
- Full name: Brian Richardson
-
Gian-Carlo Pascutto
- Posts: 1260
- Joined: Sat Dec 13, 2008 7:00 pm
Re: MCTS beginner questions
We've used both net2net and retraining on all data for upgrading the network size. In theory net2net should work better, or at least faster.
-
jp
- Posts: 1490
- Joined: Mon Apr 23, 2018 7:54 am
Re: MCTS beginner questions
Thanks. How long does retraining on all data take? I guess that means the vast majority of time is taken in creating the data (games), not the training once you have the data. Is that the case very generally?Gian-Carlo Pascutto wrote:We've used both net2net and retraining on all data for upgrading the network size. In theory net2net should work better, or at least faster.
-
Gian-Carlo Pascutto
- Posts: 1260
- Joined: Sat Dec 13, 2008 7:00 pm
Re: MCTS beginner questions
I'm not sure what you mean. If you're upgrading the network, you already have the games. For 192x15 it took me about a week on a GTX 1070. At this point the training hasn't fully converged, but the new network is stronger than the smaller one by some margin, and that's good enough to inject it into the cycle.jp wrote: Thanks. How long does retraining on all data take? I guess that means the vast majority of time is taken in creating the data (games), not the training once you have the data. Is that the case very generally?
The whole Zero process is totally bottlenecked on producing the games. That's why it's possible to do it in a distributed manner.
-
jp
- Posts: 1490
- Joined: Mon Apr 23, 2018 7:54 am
Re: MCTS beginner questions
Yes, that's what I was asking for a numerical figure on (but maybe that's not so easy to say & not so important). I mean, the week you took for 192x15 was once the distributed computing had generated all the games, right? I just wondered how much time (in units of GTX1070 equivalent) it took to generate those games (e.g. 999 GTX1070 weeks would mean 99.9% bottleneck).Gian-Carlo Pascutto wrote:...
I'm not sure what you mean. If you're upgrading the network, you already have the games. For 192x15 it took me about a week on a GTX 1070. At this point the training hasn't fully converged, but the new network is stronger than the smaller one by some margin, and that's good enough to inject it into the cycle.
The whole Zero process is totally bottlenecked on producing the games. That's why it's possible to do it in a distributed manner.
In general, I'm just trying to get an idea of the final performance after training and upgrading and repeating over and over vs. starting with the largest NN size (for the same total computer time). Does the former really give a better final performance? Of course, you get more games more quickly to begin with, but in some sense those games might be "lower quality" (or maybe not, as long as the small NN is not saturated??).
One advantage I see of starting with a small NN and upgrading is that debugging should be quicker & you'll get a bug-free product sooner.
-
Gian-Carlo Pascutto
- Posts: 1260
- Joined: Sat Dec 13, 2008 7:00 pm
Re: MCTS beginner questions
I don't worry too much about final performance being different, but its's true at the end you may take more resources to reach the final performance.jp wrote: Does the former really give a better final performance?
But you have more chances of getting there. Not only due to faster debugging, but also because it's easier to get people to join the project if you make good progress.
Also I consider it somewhat of a nobrainer that the first batch of random games might as well be done with a smaller network, so surely the optimal in compute terms must lie someshwere in the middle, right.
-
jp
- Posts: 1490
- Joined: Mon Apr 23, 2018 7:54 am
Re: MCTS beginner questions
Yes, that sounds reasonable. There presumably exists some unknown optimal NN upgrade schedule.
-
jp
- Posts: 1490
- Joined: Mon Apr 23, 2018 7:54 am
Re: MCTS beginner questions
Thanks for the reference.brianr wrote:Leela Chess is benefiting from much of the work done for Leela Go.
Net2net is one of the things used by Leela Go that I think has been used for Chess also.
The paper is here:
https://arxiv.org/abs/1511.05641
It looks like it uses the current smaller net weights to bootstrap the new larger net.
I think a training step using sample data is then still needed, but it learns faster instead of using random weights to start with.
...
Alexander Lyashuk also wrote the following (re. net2net):
"Increasing number of blocks:
by inserting blocks which are initialized in such way that they do nothing (except small noise), so input = output.
Increasing number of filters in a block:
(roughly speaking) by splitting existing nodes into several. E.g. of x = 5y was one node, it's replaced with x = 2y + 3y (two nodes).
(this is overly simplified and may be wrong)."