LCZero: Progress and Scaling. Relation to CCRL Elo
Moderators: hgm, Rebel, chrisw
-
- Posts: 1346
- Joined: Sat Apr 19, 2014 1:47 pm
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
Lczero is still making progress in selfplay, so I guess it is just a question of time until he gets better against others opponent as well.
-
- Posts: 560
- Joined: Sun Nov 08, 2015 11:10 pm
-
- Posts: 3019
- Joined: Wed Mar 08, 2006 9:57 pm
- Location: Rio de Janeiro, Brazil
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
I can confirm the regression. And yes, i think using a small innocuos opening book is fine to ensure diversity of starting positions. We are talking just a minimal opening position, not some decisive theory. Mind you, I am testing Leela against versions of herself. Here are the posted ratings of each NN I am testing, using NN223 as a baseline of zero:jkiliani wrote: ↑Thu May 17, 2018 1:51 pmMost recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.Laskos wrote: ↑Thu May 17, 2018 9:10 am Red lines are one standard deviation. There seem to have been an improvement, but I guess there are still critical bugs in their engine v0.10. They are very careless adding 100+ commits since v0.7, without any proper testing.
They see in the last 2 datapoints a 130 Elo points progress, I see no progress at all. They don't seem to run regression tests, and are just comparing to previous version with "freezing temperature", if I understood. Never mind that these small "gains" could be almost orthogonal taken successively, so all in all add to nothing in a regression test.
The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.
Official ratings
NN223: 0
NN253: +241
NN303: +311
In my testing of hundreds of games at 1m+1s, Leela only playing herself, and only these three NN versions:
Real-life performance
NN223: 0
NN253: -8
NN303: -57
I have also watched the games, and NN303 is showing an astonishing number of bad evaluations that NN223 did not show. The most notable is king safety. NN303 seems completely oblivious to her king's safety and lets NN223 mount big attacks with no resistance. She will even declare she is doing better, while NN223 thinks she is already winning. I have seen this almost non-stop. Here is a sample screenshot from a match I am still running:
NN223 is black, and NN303 is white.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
The model of testing just versus the last network to accept the new ones is not working (and it was clear from the very beginning that it's flawed) but no one listens, so what can one do?Albert Silver wrote: ↑Thu May 17, 2018 4:40 pmI can confirm the regression. And yes, i think using a small innocuos opening book is fine to ensure diversity of starting positions. We are talking just a minimal opening position, not some decisive theory. Mind you, I am testing Leela against versions of herself. Here are the posted ratings of each NN I am testing, using NN223 as a baseline of zero:jkiliani wrote: ↑Thu May 17, 2018 1:51 pmMost recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.Laskos wrote: ↑Thu May 17, 2018 9:10 am Red lines are one standard deviation. There seem to have been an improvement, but I guess there are still critical bugs in their engine v0.10. They are very careless adding 100+ commits since v0.7, without any proper testing.
They see in the last 2 datapoints a 130 Elo points progress, I see no progress at all. They don't seem to run regression tests, and are just comparing to previous version with "freezing temperature", if I understood. Never mind that these small "gains" could be almost orthogonal taken successively, so all in all add to nothing in a regression test.
The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.
Official ratings
NN223: 0
NN253: +241
NN303: +311
In my testing of hundreds of games at 1m+1s, Leela only playing herself, and only these three NN versions:
Real-life performance
NN223: 0
NN253: -8
NN303: -57
I have also watched the games, and NN303 is showing an astonishing number of bad evaluations that NN223 did not show. The most notable is king safety. NN303 seems completely oblivious to her king's safety and lets NN223 mount big attacks with no resistance. She will even declare she is doing better, while NN223 thinks she is already winning. I have seen this almost non-stop. Here is a sample screenshot from a match I am still running:
-
- Posts: 1142
- Joined: Thu Dec 28, 2017 4:06 pm
- Location: Argentina
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
That's not how it works. EVERY network is accepted, even the ones that are negative. The self-play elo is useless, was already said ages ago
It's there just to check the recent progress, you can't use it to compare current network with one that is 50-70 networks old (and with several critical bugs in the middle).
Follow my tournament and some Leela gauntlets live at http://twitch.tv/ccls
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
Yes, I use a very short 3-mover book of solid, balanced GM openings. I think LC0 must play well from all those short, solid lines, so this procedure is probably more sound than using no book at all in matches for determining strength (not in learning).jkiliani wrote: ↑Thu May 17, 2018 1:51 pmMost recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.Laskos wrote: ↑Thu May 17, 2018 9:10 am Red lines are one standard deviation. There seem to have been an improvement, but I guess there are still critical bugs in their engine v0.10. They are very careless adding 100+ commits since v0.7, without any proper testing.
They see in the last 2 datapoints a 130 Elo points progress, I see no progress at all. They don't seem to run regression tests, and are just comparing to previous version with "freezing temperature", if I understood. Never mind that these small "gains" could be almost orthogonal taken successively, so all in all add to nothing in a regression test.
The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.
-
- Posts: 3019
- Joined: Wed Mar 08, 2006 9:57 pm
- Location: Rio de Janeiro, Brazil
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
I spoke with several in the Discord channels, and the problems are well acknowledged. There is no argument, and the question now is when to do a rollback, and to what end. In fact some have even been experimenting with hybrid NN versions, implanting NN237 values (the last healthy NN supposedly) into NN302.Milos wrote: ↑Thu May 17, 2018 4:50 pmThe model of testing just versus the last network to accept the new ones is not working (and it was clear from the very beginning that it's flawed) but no one listens, so what can one do?Albert Silver wrote: ↑Thu May 17, 2018 4:40 pmI can confirm the regression. And yes, i think using a small innocuos opening book is fine to ensure diversity of starting positions. We are talking just a minimal opening position, not some decisive theory. Mind you, I am testing Leela against versions of herself. Here are the posted ratings of each NN I am testing, using NN223 as a baseline of zero:jkiliani wrote: ↑Thu May 17, 2018 1:51 pm
Most recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.
The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.
Official ratings
NN223: 0
NN253: +241
NN303: +311
In my testing of hundreds of games at 1m+1s, Leela only playing herself, and only these three NN versions:
Real-life performance
NN223: 0
NN253: -8
NN303: -57
I have also watched the games, and NN303 is showing an astonishing number of bad evaluations that NN223 did not show. The most notable is king safety. NN303 seems completely oblivious to her king's safety and lets NN223 mount big attacks with no resistance. She will even declare she is doing better, while NN223 thinks she is already winning. I have seen this almost non-stop. Here is a sample screenshot from a match I am still running:
https://github.com/Neurodynasoft/LCZero-Tools
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
-
- Posts: 560
- Joined: Sun Nov 08, 2015 11:10 pm
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
Ok, then we all agree that the self-play ELO progression grapth is utterly inflated and eventually it will get to a point that it will not tell the recent progress because it will then fall into large error bars. If it is useless as a measurement of performance then you should probably change that into a graph of daily number of games contributed, which would make sense to track the progress of the project after all.CMCanavessi wrote: ↑Thu May 17, 2018 5:45 pmThat's not how it works. EVERY network is accepted, even the ones that are negative. The self-play elo is useless, was already said ages ago
It's there just to check the recent progress, you can't use it to compare current network with one that is 50-70 networks old (and with several critical bugs in the middle).
How to correlate with CCRL ratings is not a problem of technical but more of a PR issue because the relative performance can be measured on no matter what hardware configurations are as long as they remain constant, you gain 10 ELO under the same conditions versus the previous version is not relevant to where you are at 1500 or 3500, but just the graph won't look very pretty.
Last edited by noobpwnftw on Thu May 17, 2018 7:27 pm, edited 3 times in total.
-
- Posts: 560
- Joined: Sun Nov 08, 2015 11:10 pm
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
The hybrid approach is just madness, people mixing or implanting different generation of network weights is apparently a work of non-zero intervention, aka supervised training.Albert Silver wrote: ↑Thu May 17, 2018 6:49 pmI spoke with several in the Discord channels, and the problems are well acknowledged. There is no argument, and the question now is when to do a rollback, and to what end. In fact some have even been experimenting with hybrid NN versions, implanting NN237 values (the last healthy NN supposedly) into NN302.Milos wrote: ↑Thu May 17, 2018 4:50 pmThe model of testing just versus the last network to accept the new ones is not working (and it was clear from the very beginning that it's flawed) but no one listens, so what can one do?Albert Silver wrote: ↑Thu May 17, 2018 4:40 pm
I can confirm the regression. And yes, i think using a small innocuos opening book is fine to ensure diversity of starting positions. We are talking just a minimal opening position, not some decisive theory. Mind you, I am testing Leela against versions of herself. Here are the posted ratings of each NN I am testing, using NN223 as a baseline of zero:
Official ratings
NN223: 0
NN253: +241
NN303: +311
In my testing of hundreds of games at 1m+1s, Leela only playing herself, and only these three NN versions:
Real-life performance
NN223: 0
NN253: -8
NN303: -57
I have also watched the games, and NN303 is showing an astonishing number of bad evaluations that NN223 did not show. The most notable is king safety. NN303 seems completely oblivious to her king's safety and lets NN223 mount big attacks with no resistance. She will even declare she is doing better, while NN223 thinks she is already winning. I have seen this almost non-stop. Here is a sample screenshot from a match I am still running:
https://github.com/Neurodynasoft/LCZero-Tools
Proper way to do rollback at this point is to revert the network or bootstrap anew with pre-existing data excluding the bug affected ones. In LG0 it was shown plausible to inject self-play games from a future self(played by ELF weights), so it is probably more sound to do a revert than a full bootstrap. Note that this only needs work on one machine capable for the network training, not even necessarily "the one" that is running the training as we speak, it is not like every LC0 contributor has to do collaborated work to get it done in a month or something, not doing anything in this regard but let people burn their electricity and "wait and see" is just laziness, oh, they asked for donations to buy a new training machine, no?
-
- Posts: 13447
- Joined: Wed Mar 08, 2006 9:02 pm
- Location: Dallas, Texas
- Full name: Matthew Hull
Re: LCZero: Progress and Scaling. Relation to CCRL Elo
Would it be more sound to measure human strength that way too?
Matthew Hull