LCZero: Progress and Scaling. Relation to CCRL Elo

jkiliani · Post by **jkiliani** » Tue Apr 17, 2018 11:31 am

mirek wrote:
Laskos wrote: Here is a recent result of Carlos Canavessi till ID138, it also shows a significant regression.
Oh no, any explanation why is this happening? Is it already a reason for concern or is it still OK? And at which point if the trend continues will it become reason for concern? Shouldn't have the larger net just skyrocketed in performance? Could it be that if trained on the larger net all the way from the beginning the progress could have been higher?

There is likely a problem with the training pipeline, i.e. something with the setup always-promote networks, cyclic learning rate schedule with current values, step counts used etc. is not working very well. There's a lot of discussion going on as to the likely reasons why, and we can expect to see procedural changes soon (which everyone hopes will work)

melajara · Post by **melajara** » Tue Apr 17, 2018 11:47 am

Indeed, no progress anymore when AlphaZero was progressing smoothly.
I'm aware Google is indirectly supporting this project by authorizing people to create several clients aiming at a freely available TPU (in shared mode).

However, it might be more useful to have some DeepMind insider involved with the still controversial Chess AlphaZero results to be authorized to review and comment the Leela code to pinpoint eventual implementations errors.

It's now almost 6 months that we had that much-hyped preprint article about AlphaZero over arxiv and still no update.

Time for DeepMind/Google to somewhat open up this technology to make it really credible, IMHO.

jkiliani · Post by **jkiliani** » Tue Apr 17, 2018 11:56 am

melajara wrote:Indeed, no progress anymore when AlphaZero was progressing smoothly.
I'm aware Google is indirectly supporting this project by authorizing people to create several clients aiming at a freely available TPU (in shared mode).

However, it might be more useful to have some DeepMind insider involved with the still controversial Chess AlphaZero results to be authorized to review and comment the Leela code to pinpoint eventual implementations errors.

It's now almost 6 months that we had that much-hyped preprint article about AlphaZero over arxiv and still no update.

Time for DeepMind/Google to somewhat open up this technology to make it really credible, IMHO.

We can't rely on Deepmind to solve our problems for us. The experience with them is that they don't give out any information except what they publish. The differences between AlphaZero and Alphago Zero in the training pipeline are actually substantial enough that there are a number of things that could be tried from the Alphago Zero setup. I'm confident we'll find a solution to these roadblocks by ourselves soon enough, Leela Zero (Go) is doing just fine for example, following in their footsteps is still a viable path if the current setup has problems.

CMCanavessi · Post by **CMCanavessi** » Tue Apr 17, 2018 11:59 am

The gauntlet is currently at 54/127, way lower than the 12x networks I tested before. We'll see how Leela evolves from here.

Michel · Post by **Michel** » Tue Apr 17, 2018 12:15 pm

There is likely a problem with the training pipeline, i.e. something with the setup always-promote networks, cyclic learning rate schedule with current values, step counts used etc. is not working very well. There's a lot of discussion going on as to the likely reasons why, and we can expect to see procedural changes soon (which everyone hopes will work)

In Dutch we say "meten is weten" (to measure is to know).

Before you do anything you should obtain reliable elo measurements....

For classical chess engines self play elo correlates very well with elo against foreign opponents. From the test results that are posted here it is not clear that this is also true for LC0.

I would propose that the matches are done against some version of SF and that the results are considered with proper error bars. Since the matches are not used for validation, less frequent matches could be organized with higher resolution.

Several TC's should be tried to get an idea about the scaling behaviour of LC0 (it is also not clear that this is the same as for classical engines).

Laskos · Post by **Laskos** » Tue Apr 17, 2018 7:46 pm

Michel wrote:
There is likely a problem with the training pipeline, i.e. something with the setup always-promote networks, cyclic learning rate schedule with current values, step counts used etc. is not working very well. There's a lot of discussion going on as to the likely reasons why, and we can expect to see procedural changes soon (which everyone hopes will work)
In Dutch we say "meten is weten" (to measure is to know).

Before you do anything you should obtain reliable elo measurements....

For classical chess engines self play elo correlates very well with elo against foreign opponents. From the test results that are posted here it is not clear that this is also true for LC0.

I would propose that the matches are done against some version of SF and that the results are considered with proper error bars. Since the matches are not used for validation, less frequent matches could be organized with higher resolution.

Several TC's should be tried to get an idea about the scaling behaviour of LC0 (it is also not clear that this is the same as for classical engines).

I don't know what they bootstrap and tune to get new nets there, but the latest ID140 is again among the weakest, significantly weaker than ID124, a 100+ Elo loss compared to ID124 (the second "bignet", which improved greatly over ID123, the first "bignet" based solely on "smallnet" weights). And that after more than a million games since ID124. My tests now are at short TC and on CPU, but they seem to be of some representativeness.

There were some questions put about the openings, whether LC0 really plays better the openings compared to later parts of the games. In the past, with v0.4, I was able to test EPD suites, and the "smallnet" showed excellent results on positional opening suite, close to top engines. Also showed deplorable results in tactical testsuites, worse than very weak engines. With v0.6 and "bignets", I am unable to test EPD suites in Polyglot or GUIs, it seems PV is not outputted in standard form (at fixed time, at least). But to the question whether LC0 plays better in openings (general play, i.e positional + tactical), I can answer:

I took ID124 "bignet" with v0.6 client, one of the best nets, and performed 2 tests: from 3-mover balanced opening suite and from 8-mover balanced opening suite (5 moves difference). I pitted at short TC (1s/move) ID124 against similar in strength (somewhat stronger at this TC) stable standard engine Jabba 1.0.

3-mover result:

Code: Select all

Games Completed = 500 of 500 &#40;Avg game length = 103.231 sec&#41;
Settings = Gauntlet/64MB/1000ms per move/M 5500cp for 30 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\3moves_GM_04.epd&#40;817&#41;
Time = 14275 sec elapsed, 0 sec remaining

 1.  LCZero CPU ID124         	172.5/500	116-271-113  	&#40;L&#58; m=271 t=0 i=0 a=0&#41;	&#40;D&#58; r=82 i=13 f=13 s=2 a=3&#41;	&#40;tpm=948.5 d=12.49 nps=230&#41;
 2.  Jabba 1.0                	327.5/500	271-116-113  	&#40;L&#58; m=116 t=0 i=0 a=0&#41;	&#40;D&#58; r=82 i=13 f=13 s=2 a=3&#41;	&#40;tpm=802.5 d=9.11 nps=0&#41;

172.5/500 from 3-mover

8-mover result:

Code: Select all

Games Completed = 500 of 500 &#40;Avg game length = 90.792 sec&#41;
Settings = Gauntlet/64MB/1000ms per move/M 9000cp for 30 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\8moves_v7.epd&#40;21067&#41;
Time = 12713 sec elapsed, 0 sec remaining

 1.  LCZero CPU ID124         	142.5/500	89-304-107  	&#40;L&#58; m=304 t=0 i=0 a=0&#41;	&#40;D&#58; r=88 i=11 f=5 s=2 a=1&#41;	&#40;tpm=949.2 d=12.49 nps=180&#41;
 2.  Jabba 1.0                	357.5/500	304-89-107  	&#40;L&#58; m=89 t=0 i=0 a=0&#41;	  &#40;D&#58; r=88 i=11 f=5 s=2 a=1&#41;	&#40;tpm=803.0 d=9.10 nps=0&#41;

142.5/500 from 8-mover

So, although Jabba 1.0 is stronger generally at this TC, LC0 in just 5 more moves in the opening left alone against Jabba, gains a whopping 50 Elo points (about 20 Elo points standard deviation of the difference). It is remarkable, as those 5 moves in the opening are not all positional, there are some tactics involved too, at which LC0 is notoriously weak. So, yes, LC0 performs (significantly) better in openings compared to later parts of the game.

Laskos · Post by **Laskos** » Mon Apr 23, 2018 10:54 am

Again, no real progress from ID160 to ID173, in 700,000 games. Official graph also shows the stall, although many times it is not representative of the real progress.

The scaling in time (and hardware) of LC0 is much better than that of A/B engines. So, if one talks of LC0 CCRL Elo, one should specify the time control and hardware. For example, on 1 CPU core at 1s/move, it is about 2000 CCRL Elo points, but on a GPU like Nvidia 1080 Ti at LTC (say 2 minutes/move) it might be 2750 CCRL Elo points. I tried see how this better scaling can be seen in two cases, using two very different test-suites:

Scaling:

1/ Tactical middlegame ECM200:

==========

LC0 ID160:

2s:
score=56/200 [averages on correct positions: depth=10.6 time=0.46 nodes=123]
20s:
score=75/200 [averages on correct positions: depth=13.2 time=3.15 nodes=1107]

+19
==========

GreKo 6.5 (2330 CCRL)

2s:
score=110/200 [averages on correct positions: depth=5.8 time=0.19 nodes=454689]
20s:
score=143/200 [averages on correct positions: depth=7.3 time=1.91 nodes=4718200]

+33
==========

Predateur 2.2.1 (1786 CCRL)

2s:
score=88/200 [averages on correct positions: depth=7.0 time=0.30 nodes=923547]
20s:
score=107/200 [averages on correct positions: depth=8.0 time=2.15 nodes=6568689]

+19
==========

LC0 doesn't seem to scale better than standard A/B engines on tactical middlegame test-suite.

2/ Positional opening suite (200 positions)

==========

LC0 ID160

2s:
score=92/200 [averages on correct positions: depth=8.3 time=0.16 nodes=41]
20s:
score=117/200 [averages on correct positions: depth=11.0 time=2.18 nodes=694]

+25
==========

GreKo 6.5 (2330 CCRL)

2s:
score=72/200 [averages on correct positions: depth=4.8 time=0.17 nodes=325872]
20s:
score=78/200 [averages on correct positions: depth=6.9 time=1.64 nodes=3262982]

+6
==========

Andscacs 0.93 (3308 CCRL)

2s:
score=113/200 [averages on correct positions: depth=9.9 time=0.23 nodes=924945]
20s:
score=126/200 [averages on correct positions: depth=13.0 time=2.26 nodes=8718487]

+13
==========

LC0 seems to scale significantly better than standard A/B engines on positional opening test-suite.

=================================================================

The result somehow puzzles me. It seems that letting analyzing (searching) for longer time on stronger hardware improves more the positional understanding of LC0 than tactical one. Somebody can explain what happens, or these results are meaningless? It would seem that monstrous hardware wouldn't help by extremely much the serious tactical deficiency of LC0. It would help, but it seems to help even more the positional understanding. If A0 was so strong on tactics, then positionally it might have been a monster, if my results here mean something. Well, OTOH, the newer nets might improve dramatically tactically (it doesn't seem to be the case up to now, but who knows), so it's a speculation.

Michel · Post by **Michel** » Mon Apr 23, 2018 11:22 am

Here it seems to suggest that id170

https://docs.google.com/spreadsheets/d/ ... edit#gid=0

is of a similar level as Fruit 2.1 at 1min+1s.

The error bars are quite big but it seems there was real progress from id160 to id170.

Laskos · Post by **Laskos** » Mon Apr 23, 2018 11:57 am

Michel wrote:Here it seems to suggest that id170

https://docs.google.com/spreadsheets/d/ ... edit#gid=0

is of a similar level as Fruit 2.1 at 1min+1s.

The error bars are quite big but it seems there was real progress from id160 to id170.

Hmm, interesting. He has a good GPU (still probably not 1080 Ti), but with these nps and time control (1'+ 1'') I expected a 2450 CCRL Elo performance or so. Isn't his 2685 CCRL Elo too much? I think in TCEC conditions, or at TCEC time control on 1080 Ti driven by 4 cores, if that performance was true, it would go close to 3000 CCRL Elo points. But in TCEC itself, an old, but still strong LC0 performs 300-400 Elo points weaker than 3000-3050 CCRL Elo level engines.

His error margins (2SD?) are some 50 Elo points, my would be 30, if combining my results for 160 and 163, compared to combined 170 and 173. All in all, might be a progress, but I believe not by much.

George Tsavdaris · Post by **George Tsavdaris** » Mon Apr 23, 2018 1:14 pm

Laskos wrote: Hmm, interesting. He has a good GPU (still probably not 1080 Ti), but with these nps and time control (1'+ 1'') I expected a 2450 CCRL Elo performance or so. Isn't his 2685 CCRL Elo too much?

For ID 170, watching its games is a big step forward!

Look for example here also.
2700 CCRL ELO seems very likely on good GPU.
https://docs.google.com/spreadsheets/d/ ... edit#gid=0

Code: Select all

v7 slowmover125	id157	laser 1.0 2728 ccrl 40/1	Score of lczero v7 id157 slowmover 125 vs Laser-1_0&#58; 11 - 19 - 13 &#91;0.407&#93;
Elo difference&#58; -65.40 +/- 89.61
		
v7 slowmover125	id160	laser 1.0		40/1	Score of lczero v7 id160 slowmover 125 vs Laser-1_0&#58; 5 - 10 - 6 &#91;0.381&#93;	
Elo difference&#58; -84.34 +/- 135.12
		
v7 slowmover125	id162	laser 1.0		40/1	Score of lczero v7 id162 slowmover 125 vs Laser-1_0&#58; 32 - 40 - 29 &#91;0.460&#93;
Elo difference&#58; -27.58 +/- 57.77		

v7 slowmover120	id164	laser 1.0		40/1	Score of lczero v7 id164 slowmover 120 vs Laser-1_0&#58; 26 - 25 - 8 &#91;0.508&#93;
Elo difference&#58; 5.89 +/- 83.91		

v7 slowmover120	id170	laser 1.0		40/1	Score of lczero v7 id170 slowmover 120 vs Laser-1_0&#58; 27 - 21 - 14 &#91;0.548&#93;
Elo difference&#58; 33.73 +/- 77.54		

v7 slowmover120	id171	laser 1.0 4cpu	2800 ccrl 40/1	Score of lczero v7 id171 slowmover 120 vs Laser-1_0 4cpu&#58; 19 - 30 - 10 &#91;0.407&#93;
Elo difference&#58; -65.54 +/- 83.56

LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo