LCZero: Progress and Scaling. Relation to CCRL Elo

Guenther · Post by **Guenther** » Thu May 17, 2018 8:07 am

David Xu wrote: ↑Thu May 17, 2018 3:45 am
jp wrote:
I have a question for you: do you have any idea what you're talking about when you comment in these threads?

You can add him to your ignore (foe) list. I have done this very soon after his first posts.

mar · Post by **mar** » Thu May 17, 2018 8:20 am

Laskos wrote: ↑Thu May 17, 2018 12:22 am Yes, the same for ID302 compared to ID292, no improvement (well, within error margins, so there is maybe at most 20 Elo points improvement). They see 130 Elo points improvement in self-games. Either something is wrong with my testing, or again something is fishy in their framework.

Well I'm playing some test games with ID303 and so far (20 games played) it seems not 100 elo stronger than ID 24x I played last time, but rather 100 elo weaker....
Still too early to draw conclusions, but 25% after 20 games when I expected Leela to be on par with Cheng according to their elo graph, so far a disappointment.
Note that I'm using 40 moves in 2 min now so the TC should be better for Leela than 40/1min I played before (note it's still the official OpenCL-based engine).

What exactly does their elo graph show anyway? Do they run regression tests from time to time or is it just delta from the previous version?
If so then that's pretty much random and useless if improvements are small.

Anyway, always the same story with Leela: blundering random moves like crazy,
losing to shallow tactics. I even saw Leela blunder twice in a single game, first throwing away a win then wasting a draw
- no way they can compete with the top dogs with this approach on consumer HW (not to mention that current SF should be on par with A0 elo-wise on Google HW).

I plan to play 200 games to get a rough idea of how strong the current engine + net is, I'll post the results here.

Laskos · Post by **Laskos** » Thu May 17, 2018 9:10 am

mar wrote: ↑Thu May 17, 2018 8:20 am
Laskos wrote: ↑Thu May 17, 2018 12:22 am Yes, the same for ID302 compared to ID292, no improvement (well, within error margins, so there is maybe at most 20 Elo points improvement). They see 130 Elo points improvement in self-games. Either something is wrong with my testing, or again something is fishy in their framework.
Well I'm playing some test games with ID303 and so far (20 games played) it seems not 100 elo stronger than ID 24x I played last time, but rather 100 elo weaker....
Still too early to draw conclusions, but 25% after 20 games when I expected Leela to be on par with Cheng according to their elo graph, so far a disappointment.
Note that I'm using 40 moves in 2 min now so the TC should be better for Leela than 40/1min I played before (note it's still the official OpenCL-based engine).

What exactly does their elo graph show anyway? Do they run regression tests from time to time or is it just delta from the previous version?
If so then that's pretty much random and useless if improvements are small.

Anyway, always the same story with Leela: blundering random moves like crazy,
losing to shallow tactics. I even saw Leela blunder twice in a single game, first throwing away a win then wasting a draw
- no way they can compete with the top dogs with this approach on consumer HW (not to mention that current SF should be on par with A0 elo-wise on Google HW).

I plan to play 200 games to get a rough idea of how strong the current engine + net is, I'll post the results here.

So, some sort of confirmation, both on overall strength and on tactics. I was wondering about the validity of my test against one AB engine, LC0 being on CPU and with pretty low number of playouts. The Elo graph is here, since the first bigger net ID227:

Red lines are one standard deviation. There seem to have been an improvement, but I guess there are still critical bugs in their engine v0.10. They are very careless adding 100+ commits since v0.7, without any proper testing.
They see in the last 2 datapoints a 130 Elo points progress, I see no progress at all. They don't seem to run regression tests, and are just comparing to previous version with "freezing temperature", if I understood. Never mind that these small "gains" could be almost orthogonal taken successively, so all in all add to nothing in a regression test.

We are in agreement also on easy tactics: it is worse now with ID302 than with the initial ID227. I used Albert's cleaned WAC201.epd tactical suite

6s/position on 4 CPU threads, equivalent to 1s/position on GTX 1060:

ID227
score=84/201 [averages on correct positions: depth=11.1 time=0.96 nodes=178]

ID302
score=74/201 [averages on correct positions: depth=11.3 time=1.21 nodes=190]

So, even if it gained Elo points since ID227, the easy tactics is even worse. I think they have to roll-back to a less buggy engine, say v0.7 and older nets, and then accept commits after severe vetting and testing (more or less SF framework).

Guenther · Post by **Guenther** » Thu May 17, 2018 10:44 am

I have been running my own tests since April. ID 303 is currently being tested (ID 100 will be added later)
A google spreadsheet with details/graph and conditions(+games) is prepared but not ready yet for publishing.

Each LCZero version always plays 10*30 games vs. the same 10 opponents.
Each 30 games batch is randomly played from a small ~1200 3 moves pgn with reversed colours.

TC always is 5+5 vs. 2+1, thus a timeodds of around 3.5-4.5:1 in favour of LCZero
to mimic a better gpu. Current used gpu is very weak but not old. (with current net size around 70-80nps)
Actually I bought it for around 30€ and the reason for it was, that it is cooled passive (no fan) and thus absolutely silent.
(The one before slowly died with a hell of a noise sometimes due to a damaged fan)

Below is the current CCRL40/4 calibrated result calculated with ordo with 400 simuls.

A little note:
Counter 1.2-64 is still and always was outside the err window.
This was the reason why I asked for it being 32 or 64 bit in CCRL.
(the result was that it was able to run both ways, but this was not
distinguished in the ratings)
viewtopic.php?f=2&t=67250

This means the CCRL rating for Counter_12-64 should be likely a bit higher,
and this would shift ratings for all LCZero entities a bit higher in comparison.

Code: Select all

#       PLAYER          RATING   ERROR  POINTS  PLAYED  (%)     CCRL 40/4(1)    CCRL 40/40(2)   Diff 1  Diff 2
1       Chronos_197     2631.48  49.51   73.0   150     48.7    2639            2639            -7.52    -7.52
2       Counter_12-64   2505.12  50.05   49.5   150     33.0    2446            2468            59.12    37.12
3       Danasah_70      2592.51  49.77   65.5   150     43.7    2596            2611            -3.49   -18.49
4       Glaurung_201-64 2720.07  51.65   90.0   150     60.0    2740            2745           -19.93   -24.93
5       Hermann_25-64   2510.87  50.68   50.5   150     33.7    2512            2496            -1.13    14.87
6       Jellyfish_11-64 2628.90  49.21   72.5   150     48.3    2608            2577            20.90    51.90
7       LCZero_07ID125  2509.54* 34.74  109.5   300     36.5    *               *                *       *
8       LCZero_07ID150  2518.65* 35.74  113.0   300     37.7    *               *                *       *
9       LCZero_07ID181  2669.39* 35.10  174.0   300     58.0    *               *                *       *
10      LCZero_07ID231  2740.88* 35.92  201.5   300     67.2    *               *                *       *
11      LCZero_010ID254 2767.63* 36.07  211.0   300     70.3    *               *                *       *
12      LCZero_010ID303 *        *      *       *       *       *               *                *       *
13      Monolith_04-64  2574.05  48.82   62.0   150     41.3    2597            2591            -22.95  -16.95
14      Rodent_10-64    2683.21  48.12   83.0   150     55.3    2692            2677             -8.79    6.21
15      Rotor_08        2613.37  45.95   69.5   150     46.3    2612            2628              1.37  -14.63
16      Tucano_400-64   2644.39  48.60   75.5   150     50.3    2662            2664            -17.61  -19.61
---------------------------------------------------------------------------------------------------------------
Gauntlet Opp Rating     2610.40                                 2610.40         2609.60           0.00    0.80
                        avg                                     adapted avg     avg               avg     avg

Guenther · Post by **Guenther** » Thu May 17, 2018 10:52 am

mar wrote: ↑Thu May 17, 2018 8:20 am
What exactly does their elo graph show anyway? Do they run regression tests from time to time or is it just delta from the previous version?
If so then that's pretty much random and useless if improvements are small.

They did a very few regression tests in the past and only one lately. (233 vs. 292)
http://lczero.org/matches

Anyhow as you have noticed and what is mentioned since long, the SP ratings are quite meaningless for various reasons.

Guenther

jp · Post by jp » Thu May 17, 2018 11:24 am

David Xu wrote: ↑Thu May 17, 2018 3:45 am
jp wrote: ↑Wed May 16, 2018 7:56 pm
yanquis1972 wrote: ↑Wed May 16, 2018 7:07 pm

I have a question for you: do you have any idea what you're talking about when you comment in these threads?

I have to check. David Xu, are you asking me that question? If you're not, please ignore the following.

If you are, I don't know why you resort to personal attacks.
Do you realize that once someone else wrongly grouped me with you and attacked me for your views? I didn't see you reply then to attack him or to tell him it was your views he was attacking, not mine. Another person attacked you then, again without you responding. Why not?
I have never attacked you. I have never attacked anyone here.
So you decide you have to butt in to a conversation with yanquis1972 and Albert and attack me?

May I ask what your special qualifications are?

You appear to be extremely intolerant of anyone saying anything you don't like, even if they are not speaking to you and even if they don't know you don't like what they say.

Milos · Post by **Milos** » Thu May 17, 2018 11:56 am

Guenther wrote: ↑Thu May 17, 2018 8:07 am
David Xu wrote: ↑Thu May 17, 2018 3:45 am
jp wrote:
I have a question for you: do you have any idea what you're talking about when you comment in these threads?
You can add him to your ignore (foe) list. I have done this very soon after his first posts.

You tell a typical anonymous troll to add someone to ignore list, gee.
That David Xu guy posted in total 40 posts on this forum out of which 38 are oneliners, and in most of those he just calls ppl names, stalks them, and posts meaningless BS. He is someone that is the best recommendation for ignore list.
Your judgement of ppl is problematic at best.

noobpwnftw · Post by **noobpwnftw** » Thu May 17, 2018 12:18 pm

Milos wrote: ↑Thu May 17, 2018 11:56 am
Guenther wrote: ↑Thu May 17, 2018 8:07 am
David Xu wrote: ↑Thu May 17, 2018 3:45 am
I have a question for you: do you have any idea what you're talking about when you comment in these threads?
You can add him to your ignore (foe) list. I have done this very soon after his first posts.
You tell a typical anonymous troll to add someone to ignore list, gee.
That David Xu guy posted in total 40 posts on this forum out of which 38 are oneliners, and in most of those he just calls ppl names, stalks them, and posts meaningless BS. He is someone that is the best recommendation for ignore list.
Your judgement of ppl is problematic at best.

As far as I can tell, the last claim he made against me was the performance saturation can be nowhere close, now with the reality came should I enjoy returning the favor because "facts don't care about your feelings"?

Back to the topic, LC0 is doing okay, and it seems not likely to get another +400 real world ELO on an average hardware just by tossing more games into it, Zuck's team probably demonstrated that with a reasonable amount of hardware in the NN realm.

jkiliani · Post by **jkiliani** » Thu May 17, 2018 1:51 pm

Laskos wrote: ↑Thu May 17, 2018 9:10 am Red lines are one standard deviation. There seem to have been an improvement, but I guess there are still critical bugs in their engine v0.10. They are very careless adding 100+ commits since v0.7, without any proper testing.
They see in the last 2 datapoints a 130 Elo points progress, I see no progress at all. They don't seem to run regression tests, and are just comparing to previous version with "freezing temperature", if I understood. Never mind that these small "gains" could be almost orthogonal taken successively, so all in all add to nothing in a regression test.

Most recent commits are either changes to the lc0 implementation with multiple backends for neural net evaluation, bugfixes to original lczero, or diagnostic or server features. Commits that directly affect play are already handled much more conservatively now compared to a few weeks ago.

The discrepancies of self-play Elo to your testing could also stem from different methods: Afaik you test with opening books, is that correct? Self-play matches do not use a book, instead temperature (determining the chance to pick a move that did not receive the most visits) is used, mostly in the opening and much less later in game. That means that any new opening knowledge discovered, for instance which lines to prefer or to avoid, will be measured by self-play Elo but entirely missed by testing which uses a fixed book instead.

main line · Post by **main line** » Thu May 17, 2018 2:31 pm

noobpwnftw wrote: ↑Thu May 17, 2018 12:18 pm
Milos wrote: ↑Thu May 17, 2018 11:56 am
Guenther wrote: ↑Thu May 17, 2018 8:07 am

You can add him to your ignore (foe) list. I have done this very soon after his first posts.
You tell a typical anonymous troll to add someone to ignore list, gee.
That David Xu guy posted in total 40 posts on this forum out of which 38 are oneliners, and in most of those he just calls ppl names, stalks them, and posts meaningless BS. He is someone that is the best recommendation for ignore list.
Your judgement of ppl is problematic at best.
As far as I can tell, the last claim he made against me was the performance saturation can be nowhere close, now with the reality came should I enjoy returning the favor because "facts don't care about your feelings"?

Back to the topic, LC0 is doing okay, and it seems not likely to get another +400 real world ELO on an average hardware just by tossing more games into it, Zuck's team probably demonstrated that with a reasonable amount of hardware in the NN realm.

What happens? Can Lczero beats a human?

LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo