LCZero: Progress and Scaling. Relation to CCRL Elo

JJJ · Post by **JJJ** » Thu May 17, 2018 3:09 pm

Lczero is still making progress in selfplay, so I guess it is just a question of time until he gets better against others opponent as well.

noobpwnftw · Post by **noobpwnftw** » Thu May 17, 2018 4:18 pm

main line wrote: ↑Thu May 17, 2018 2:31 pm
What happens? Can Lczero beats a human?

A good human player would be somewhere around 2400ish on a engine ratings list if not lower, so by that it already passed them awhile ago.

Albert Silver · Post by **Albert Silver** » Thu May 17, 2018 4:40 pm

jkiliani wrote: ↑Thu May 17, 2018 1:51 pm
Laskos wrote: ↑Thu May 17, 2018 9:10 am Red lines are one standard deviation. There seem to have been an improvement, but I guess there are still critical bugs in their engine v0.10. They are very careless adding 100+ commits since v0.7, without any proper testing.
They see in the last 2 datapoints a 130 Elo points progress, I see no progress at all. They don't seem to run regression tests, and are just comparing to previous version with "freezing temperature", if I understood. Never mind that these small "gains" could be almost orthogonal taken successively, so all in all add to nothing in a regression test.
Most recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.

The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.

I can confirm the regression. And yes, i think using a small innocuos opening book is fine to ensure diversity of starting positions. We are talking just a minimal opening position, not some decisive theory. Mind you, I am testing Leela against versions of herself. Here are the posted ratings of each NN I am testing, using NN223 as a baseline of zero:

Official ratings

NN223: 0
NN253: +241
NN303: +311

In my testing of hundreds of games at 1m+1s, Leela only playing herself, and only these three NN versions:

Real-life performance

NN223: 0
NN253: -8
NN303: -57

I have also watched the games, and NN303 is showing an astonishing number of bad evaluations that NN223 did not show. The most notable is king safety. NN303 seems completely oblivious to her king's safety and lets NN223 mount big attacks with no resistance. She will even declare she is doing better, while NN223 thinks she is already winning. I have seen this almost non-stop. Here is a sample screenshot from a match I am still running:

NN223 is black, and NN303 is white.

Milos · Post by **Milos** » Thu May 17, 2018 4:50 pm

Albert Silver wrote: ↑Thu May 17, 2018 4:40 pm
jkiliani wrote: ↑Thu May 17, 2018 1:51 pm
Laskos wrote: ↑Thu May 17, 2018 9:10 am Red lines are one standard deviation. There seem to have been an improvement, but I guess there are still critical bugs in their engine v0.10. They are very careless adding 100+ commits since v0.7, without any proper testing.
They see in the last 2 datapoints a 130 Elo points progress, I see no progress at all. They don't seem to run regression tests, and are just comparing to previous version with "freezing temperature", if I understood. Never mind that these small "gains" could be almost orthogonal taken successively, so all in all add to nothing in a regression test.
Most recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.

The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.
I can confirm the regression. And yes, i think using a small innocuos opening book is fine to ensure diversity of starting positions. We are talking just a minimal opening position, not some decisive theory. Mind you, I am testing Leela against versions of herself. Here are the posted ratings of each NN I am testing, using NN223 as a baseline of zero:

Official ratings

NN223: 0
NN253: +241
NN303: +311

In my testing of hundreds of games at 1m+1s, Leela only playing herself, and only these three NN versions:

Real-life performance

NN223: 0
NN253: -8
NN303: -57

I have also watched the games, and NN303 is showing an astonishing number of bad evaluations that NN223 did not show. The most notable is king safety. NN303 seems completely oblivious to her king's safety and lets NN223 mount big attacks with no resistance. She will even declare she is doing better, while NN223 thinks she is already winning. I have seen this almost non-stop. Here is a sample screenshot from a match I am still running:

The model of testing just versus the last network to accept the new ones is not working (and it was clear from the very beginning that it's flawed) but no one listens, so what can one do?

CMCanavessi · Post by **CMCanavessi** » Thu May 17, 2018 5:45 pm

Milos wrote: ↑Thu May 17, 2018 4:50 pm The model of testing just versus the last network to accept the new ones is not working (and it was clear from the very beginning that it's flawed) but no one listens, so what can one do?

That's not how it works. EVERY network is accepted, even the ones that are negative. The self-play elo is useless, was already said ages ago

It's there just to check the recent progress, you can't use it to compare current network with one that is 50-70 networks old (and with several critical bugs in the middle).

Laskos · Post by **Laskos** » Thu May 17, 2018 6:14 pm

jkiliani wrote: ↑Thu May 17, 2018 1:51 pm
Laskos wrote: ↑Thu May 17, 2018 9:10 am Red lines are one standard deviation. There seem to have been an improvement, but I guess there are still critical bugs in their engine v0.10. They are very careless adding 100+ commits since v0.7, without any proper testing.
They see in the last 2 datapoints a 130 Elo points progress, I see no progress at all. They don't seem to run regression tests, and are just comparing to previous version with "freezing temperature", if I understood. Never mind that these small "gains" could be almost orthogonal taken successively, so all in all add to nothing in a regression test.
Most recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.

The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.

Yes, I use a very short 3-mover book of solid, balanced GM openings. I think LC0 must play well from all those short, solid lines, so this procedure is probably more sound than using no book at all in matches for determining strength (not in learning).

Albert Silver · Post by **Albert Silver** » Thu May 17, 2018 6:49 pm

Milos wrote: ↑Thu May 17, 2018 4:50 pm
Albert Silver wrote: ↑Thu May 17, 2018 4:40 pm
jkiliani wrote: ↑Thu May 17, 2018 1:51 pm

Most recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.

The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.
I can confirm the regression. And yes, i think using a small innocuos opening book is fine to ensure diversity of starting positions. We are talking just a minimal opening position, not some decisive theory. Mind you, I am testing Leela against versions of herself. Here are the posted ratings of each NN I am testing, using NN223 as a baseline of zero:

Official ratings

NN223: 0
NN253: +241
NN303: +311

In my testing of hundreds of games at 1m+1s, Leela only playing herself, and only these three NN versions:

Real-life performance

NN223: 0
NN253: -8
NN303: -57

I have also watched the games, and NN303 is showing an astonishing number of bad evaluations that NN223 did not show. The most notable is king safety. NN303 seems completely oblivious to her king's safety and lets NN223 mount big attacks with no resistance. She will even declare she is doing better, while NN223 thinks she is already winning. I have seen this almost non-stop. Here is a sample screenshot from a match I am still running:

The model of testing just versus the last network to accept the new ones is not working (and it was clear from the very beginning that it's flawed) but no one listens, so what can one do?

I spoke with several in the Discord channels, and the problems are well acknowledged. There is no argument, and the question now is when to do a rollback, and to what end. In fact some have even been experimenting with hybrid NN versions, implanting NN237 values (the last healthy NN supposedly) into NN302.

https://github.com/Neurodynasoft/LCZero-Tools

noobpwnftw · Post by **noobpwnftw** » Thu May 17, 2018 6:58 pm

CMCanavessi wrote: ↑Thu May 17, 2018 5:45 pm
Milos wrote: ↑Thu May 17, 2018 4:50 pm The model of testing just versus the last network to accept the new ones is not working (and it was clear from the very beginning that it's flawed) but no one listens, so what can one do?
That's not how it works. EVERY network is accepted, even the ones that are negative. The self-play elo is useless, was already said ages ago
It's there just to check the recent progress, you can't use it to compare current network with one that is 50-70 networks old (and with several critical bugs in the middle).

Ok, then we all agree that the self-play ELO progression grapth is utterly inflated and eventually it will get to a point that it will not tell the recent progress because it will then fall into large error bars. If it is useless as a measurement of performance then you should probably change that into a graph of daily number of games contributed, which would make sense to track the progress of the project after all.
How to correlate with CCRL ratings is not a problem of technical but more of a PR issue because the relative performance can be measured on no matter what hardware configurations are as long as they remain constant, you gain 10 ELO under the same conditions versus the previous version is not relevant to where you are at 1500 or 3500, but just the graph won't look very pretty.

noobpwnftw · Post by **noobpwnftw** » Thu May 17, 2018 7:12 pm

Albert Silver wrote: ↑Thu May 17, 2018 6:49 pm
Milos wrote: ↑Thu May 17, 2018 4:50 pm
Albert Silver wrote: ↑Thu May 17, 2018 4:40 pm

I can confirm the regression. And yes, i think using a small innocuos opening book is fine to ensure diversity of starting positions. We are talking just a minimal opening position, not some decisive theory. Mind you, I am testing Leela against versions of herself. Here are the posted ratings of each NN I am testing, using NN223 as a baseline of zero:

Official ratings

NN223: 0
NN253: +241
NN303: +311

In my testing of hundreds of games at 1m+1s, Leela only playing herself, and only these three NN versions:

Real-life performance

NN223: 0
NN253: -8
NN303: -57

I have also watched the games, and NN303 is showing an astonishing number of bad evaluations that NN223 did not show. The most notable is king safety. NN303 seems completely oblivious to her king's safety and lets NN223 mount big attacks with no resistance. She will even declare she is doing better, while NN223 thinks she is already winning. I have seen this almost non-stop. Here is a sample screenshot from a match I am still running:

The model of testing just versus the last network to accept the new ones is not working (and it was clear from the very beginning that it's flawed) but no one listens, so what can one do?
I spoke with several in the Discord channels, and the problems are well acknowledged. There is no argument, and the question now is when to do a rollback, and to what end. In fact some have even been experimenting with hybrid NN versions, implanting NN237 values (the last healthy NN supposedly) into NN302.

https://github.com/Neurodynasoft/LCZero-Tools

The hybrid approach is just madness, people mixing or implanting different generation of network weights is apparently a work of non-zero intervention, aka supervised training.

Proper way to do rollback at this point is to revert the network or bootstrap anew with pre-existing data excluding the bug affected ones. In LG0 it was shown plausible to inject self-play games from a future self(played by ELF weights), so it is probably more sound to do a revert than a full bootstrap. Note that this only needs work on one machine capable for the network training, not even necessarily "the one" that is running the training as we speak, it is not like every LC0 contributor has to do collaborated work to get it done in a month or something, not doing anything in this regard but let people burn their electricity and "wait and see" is just laziness, oh, they asked for donations to buy a new training machine, no?

mhull · Post by **mhull** » Thu May 17, 2018 7:37 pm

Laskos wrote: ↑Thu May 17, 2018 6:14 pm Yes, I use a very short 3-mover book of solid, balanced GM openings. I think LC0 must play well from all those short, solid lines, so this procedure is probably more sound than using no book at all in matches for determining strength (not in learning).

Would it be more sound to measure human strength that way too?

LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo