Schizophrenic rating model for Leela

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Schizophrenic rating model for Leela

Post by Laskos »

If we assume that a regular engine sees Leela as a schizophrenic, or a double personality engine, the scores one gets of regular engines against Leela can be explained by the usual Elo logistic. And an "Elo" rating can be defined for Leela, we will call it Elo_of_Leela, although Leela in a pool of regular engines doesn't obey the Elo logistic.

Let's define this schizophrenic Leela engine by the scores regular engines get against it as:

Image
(1)

Here SCORE is the score a regular engine gets in a match against Leela, ranging from 0 (0%) to 1 (100%).
A is the the degree of schizophrenia of Leela, closer to 0.5 is more accentuate double personality, closer to 0 or 1 means less schizophrenia (range ids from 0 to 1).
ELO is the Elo of regular engine.
ELO1 and ELO2 are defining personalities of Leela, 2 personalities.

We define the Elo_of_Leela in a pool of regular engines as an Elo of regular engine against which it scores exactly 1/2 (50%).
Setting SC0RE=0.5 and solving for Elo, we get Elo_of_Leela as a function of ELO1, ELO2 and A:

Image
(2)

Given the score a regular engine gets against Leela, it's hard to derive immediately the Elo of that regular engine in a pool of regular engines. We have, given the score, to derive ELO as a function of SCORE, ELO1, ELO2, A from equation (1). Against regular engines, Elo as a function of score is given by simple logistic inversion. Here the solution for Elo of regular engine is:

Image
(3)

Now one can check the model, by fitting parameters A, ELO1, ELO2 to empirical data. The model is invariant to Elo translations, only Elo differences count, so basically we have only 2 variables in the model.
The best empirical data (rating list of regular engines) at short time control on large Elo span I found are here:
http://fastgm.de/60-0.60.html
The ratings are calculated by Ordo, so they do not suffer from any compression or distortion of BayesElo. Also, the error margins are small. Time control id 60'' + 0.6''.

I used 7 datapoints from this list, from the weakest, Ethereal 8.16 to the strongest, Stockfish 10. For each datapoint (7 different regular engines), I played 1000 games of Leela (one of the latest of test30 nets) against them.
Warning: the time control used in these games was very short, 6'' + 0.1''.

The fit of the model on 7 datapoints on very large Elo span gave (from equation (3)):

A = 0.53 (close to 0.5, very schizophrenic Leela)
ELO1 - ELO2 = 1070


Basically, Leela has two personalities of similar importance in matches, differing by about 1000 Elo points.
If I choose ELO1 equal to 3500, then ELO2 is 2430.

From the equation (2), the Elo_of_Leela is 3071 Elo points. It can be translated to anything by just translating ELO1, ELO2, but keeping their difference constant. I will keep those values, translating the rating list of Andreas (fastgm), and see how the fit works.


Image

Each black datapoint is given by 1000 ultra-fast games match for each regular engine against Leela. The fit is almost perfect, with just 2 parameters fitting 7 datapoints. So, a double personality of Leela seen by regular engines is in almost perfect agreement with scores one gets when playing different rated regular engines against it. Again, Leela could be given an "Elo_of_Leela", but Leela doesn't obey the Elo logistic model of regular engines. So, in rating lists, Leela's rating may be almost arbitrary, depending on opponents. If you give here weak opponents, she would be rated lower, if we give here strong opponents, she would be rated higher. The "Elo compression" when a regular engine plays Leela is very pronounced, especially on small to medium Elo spans.
The two personalities on fairly strong GPU and reasonable time control can be defined as: one super-strong, well above any regular engine. Another the level of a mediocre regular engine. I do not know if double personality is expressed mostly in matches of games or in each game, move by move.
The only warning is that TC I used was very short.

If humans resemble Leela in playing, as many argue, including me, humans too seem schizophrenic to regular engines. The Elo ratings of humans in a pool of regular engines will be compressed, and the best a human can do is to play the best engine to improve his rating. If say top 5 engines with their CCRL ratings are introduced in human FIDE pool, they will inflate the general FIDE human ratings, and the top GMs would better play only engines to improve their FIDE rating. Probably a similar plot can be made of a human playing in a pool of regular engines, but no human will play thousands of games against strong engines in FIDE conditions to have enough empirical data.
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Schizophrenic rating model for Leela

Post by smatovic »

Laskos wrote: Mon Jan 21, 2019 11:27 am
...
So, in rating lists, Leela's rating may be almost arbitrary, depending on opponents. If you give here weak opponents, she would be rated lower, if we give here strong opponents, she would be rated higher. The "Elo compression" when a regular engine plays Leela is very pronounced, especially on small to medium Elo spans.
The two personalities on fairly strong GPU and reasonable time control can be defined as: one super-strong, well above any regular engine. Another the level of a mediocre regular engine. I do not know if double personality is expressed mostly in matches of games or in each game, move by move.
The only warning is that TC I used was very short.
...
Hmm, funny observation, I am no expert in AI psychology, but I can imagine that this behaviour is rooted in the zero-selfplay approach.

Playing millions of games in selfplay with zero prior knowledge must lead to artefacts of weaker game play.
These weaker games are not deleted from the neural network but "overwritten" by better ones.
I guess that Leela has a dozen of mixed up personalities, ranging from a 0 Elo player to 3000+ Elo player,
so if you give her a bad opponent she falls back to her earlier, bad games she played and was trained with.

--
Srdja
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Schizophrenic rating model for Leela

Post by smatovic »

...would be interesting to know, if Antifish, trained by Stockfish games only, or DeuX show also such an behaviour.

--
Srdja
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Schizophrenic rating model for Leela

Post by Laskos »

smatovic wrote: Mon Jan 21, 2019 1:02 pm
Laskos wrote: Mon Jan 21, 2019 11:27 am
...
So, in rating lists, Leela's rating may be almost arbitrary, depending on opponents. If you give here weak opponents, she would be rated lower, if we give here strong opponents, she would be rated higher. The "Elo compression" when a regular engine plays Leela is very pronounced, especially on small to medium Elo spans.
The two personalities on fairly strong GPU and reasonable time control can be defined as: one super-strong, well above any regular engine. Another the level of a mediocre regular engine. I do not know if double personality is expressed mostly in matches of games or in each game, move by move.
The only warning is that TC I used was very short.
...
Hmm, funny observation, I am no expert in AI psychology, but I can imagine that this behaviour is rooted in the zero-selfplay approach.

Playing millions of games in selfplay with zero prior knowledge must lead to artefacts of weaker game play.
These weaker games are not deleted from the neural network but "overwritten" by better ones.
I guess that Leela has a dozen of mixed up personalities, ranging from a 0 Elo player to 3000+ Elo player,
so if you give her a bad opponent she falls back to her earlier, bad games she played and was trained with.

--
Srdja
Maybe "positional" versus "tactical" or on these lines can better describe this "double personality" from the point of view of a regular engine? It can be seen in some test suites. The most positional one, Openings200, dealing with openings, repeated 5 times, gives the following results (time from 1s to 2s to solution):

Code: Select all

Lc0 v20.1 ID32700: 761/1000
Houdini 6.03:      558/1000
Komodo 12.3:       556/1000
Stockfish 10:      524/1000
Booot 6.3.1:       494/1000
Andscacs 0.95:     484/1000
Ethereal 11.00:    457/1000
Fire 7.1:          431/1000
Texel 1.07:        419/1000
An outlandish superiority of Leela on this very positional test suite.
However on WAC200, trimmed by Albert Silver very tactical suite, all above regular engines score between 195/200 to 199/200 (1s to 2s to solution), and maybe 1-2 solutions are wrong, basically they all solve almost every position. Lc0 v20.1 ID32700 scores only 150/200. Hideous result, worthy of a very weak regular engine. Also, another dichotomy is much stronger play of Leela in the regular openings compared to endgames.

There are possible causes of this "Schizophrenia". But psychology reversed: how an engine of similar strength to a top GM, say Fruit 2.1 on one strong core, is seen by this top GM? Probably as monomaniacal tactical meat-grinder, which relies in its positional understanding mostly on deep search alone, and has very weak concepts of positional play. I am not sure who is "sane", the "schizophrenic" Leela (and probably humans) or the "monomaniacal" regular engines.
Last edited by Laskos on Mon Jan 21, 2019 1:57 pm, edited 2 times in total.
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Schizophrenic rating model for Leela

Post by smatovic »

Laskos wrote: Mon Jan 21, 2019 1:47 pm
smatovic wrote: Mon Jan 21, 2019 1:02 pm
Laskos wrote: Mon Jan 21, 2019 11:27 am
...
So, in rating lists, Leela's rating may be almost arbitrary, depending on opponents. If you give here weak opponents, she would be rated lower, if we give here strong opponents, she would be rated higher. The "Elo compression" when a regular engine plays Leela is very pronounced, especially on small to medium Elo spans.
The two personalities on fairly strong GPU and reasonable time control can be defined as: one super-strong, well above any regular engine. Another the level of a mediocre regular engine. I do not know if double personality is expressed mostly in matches of games or in each game, move by move.
The only warning is that TC I used was very short.
...
Hmm, funny observation, I am no expert in AI psychology, but I can imagine that this behaviour is rooted in the zero-selfplay approach.

Playing millions of games in selfplay with zero prior knowledge must lead to artefacts of weaker game play.
These weaker games are not deleted from the neural network but "overwritten" by better ones.
I guess that Leela has a dozen of mixed up personalities, ranging from a 0 Elo player to 3000+ Elo player,
so if you give her a bad opponent she falls back to her earlier, bad games she played and was trained with.

--
Srdja
Maybe "positional" versus "tactical" or on these lines can better describe this "double personality" from the point of view of a regular engine? It can be seen in some test suites. The most positional one, Openings200, dealing with openings, repeated 5 times, gives the following results (time from 1s to 2s to solution):

Code: Select all

Lc0 v20.1 ID32700: 761/1000
Houdini 6.03:      558/1000
Komodo 12.3:       556/1000
Stockfish 10:      524/1000
Booot 6.3.1:       494/1000
Andscacs 0.95:     484/1000
Ethereal 11.00:    457/1000
Fire 7.1:          431/1000
Texel 1.07:        419/1000
An outlandish superiority of Leela on this very positional test suite.
However on WAC200, trimmed by Albert Silver very tactical suite, all above tactical engines score between 195/200 to 199/200 (1s to 2s to solution), and maybe 1-2 solutions are wrong, basically they all solve almost every position. Lc0 v20.1 ID32700 scores only 150/200. Hideous result, worth of a very weak regular engine. Also, another dichotomy is much stronger play of Leela in the regular openings compared to endgames.

There are possible causes of this "Schizophrenia". But psychology reversed: how an engine of similar strength to a top GM, say Fruit 2.1 on one strong core, is seen by this top GM? Probably as monomaniacal tactical meat-grinder, which relies in its positional understanding mostly on deep search alone, and has very weak concepts of positional play. I am not sure who is "sane", the "schizophrenic" Leela (and probably humans) or the "monomaniacal" regular engines.
Ah,
so you say there is an positional super strong Leela and an tactical weak Leela, two personalities.

--
Srdja
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Schizophrenic rating model for Leela

Post by Laskos »

smatovic wrote: Mon Jan 21, 2019 1:55 pm
Ah,
so you say there is an positional super strong Leela and an tactical weak Leela, two personalities.

--
Srdja
A proposal. But I am not sure. Is it dominated by a move by move double personality, phases of the game, game by game, many games by many games? Probably closer to the move by move double personality, if these testing suites on single positions are relevant for a rating scheme. I am not sure, but Leela clearly diverges from a simple Elo logistic when rated in pool of regular engines, and a plausible model is "double personality", as the fit shows.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Schizophrenic rating model for Leela

Post by Laskos »

Laskos wrote: Mon Jan 21, 2019 2:16 pm
smatovic wrote: Mon Jan 21, 2019 1:55 pm
Ah,
so you say there is an positional super strong Leela and an tactical weak Leela, two personalities.

--
Srdja
A proposal. But I am not sure. Is it dominated by a move by move double personality, phases of the game, game by game, many games by many games? Probably closer to the move by move double personality, if these testing suites on single positions are relevant for a rating scheme. I am not sure, but Leela clearly diverges from a simple Elo logistic when rated in pool of regular engines, and a plausible model is "double personality", as the fit shows.
It seems Leela has approximate equal parts of the two worlds: the accumulation of small advantages and say 0-2 large errors in each game, each of these effects giving rise to two different distributions of strength in many games. And these two different distributions are each close enough in CDF to logistic in many games, so that the rating scheme can be approximated by two different logistics.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Schizophrenic rating model for Leela

Post by jdart »

My impression is that when Leela wins, it is usually very interesting: it found something that alpha-beta searchers miss. When it loses, it is usually not interesting: it just goes down to a tactic, like a weaker alpha-beta engine.

--Jon
Branko Radovanovic
Posts: 89
Joined: Sat Sep 13, 2014 4:12 pm
Location: Zagreb, Croatia
Full name: Branko Radovanović

Re: Schizophrenic rating model for Leela

Post by Branko Radovanovic »

I was hoping someone would come up with an alternative, two-parameter Elo model able to describe Leela's performance...

But, aren't all engines schizophrenic to a degree? :-)

We already know that engine-to-engine games expose a weakness of the Elo model - namely, it assumes a fixed transitivity formula (A is x Elo above B, and B is y Elo above C, ergo A is x+y Elo above C), but it may not be correct in general, or it may not be correct across the Elo scale, or it may not be correct for all players (such as Leela). In particular, ratings in engine-to-engine Elo lists get compressed over time. Perhaps an alternative Elo curve - with a bit of schizophrenia thrown in - could be compression-free?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Schizophrenic rating model for Leela

Post by Laskos »

Branko Radovanovic wrote: Mon Jan 21, 2019 4:27 pm I was hoping someone would come up with an alternative, two-parameter Elo model able to describe Leela's performance...
Well, than generalizing and overhauling the whole Elo system for all engines, I just came up with "Leela rating model" for one engine. A two-parameter Elo model could surely incorporate Leela, but as the regular engines seem to obey pretty well the existing very simple Elo model + a certain draw model, it's maybe not yet the time to discard it. And if one doesn't discard it, Leela seems significantly abnormal in a pool of regular engines.

And as Jon nicely put it, this "schizophrenia" can have some intuitive common sense explanation he described. I am too inclined to his remarks on explaining the "illness". Also, if humans are more similar to Leela than to regular engines, we can see only now how abnormal the regular engines are for humans (or Leela) playing Chess. They are sick monomaniacal searchers, excelling at tactics to such a degree that even positionally they are getting strong mostly by deep search.
But, aren't all engines schizophrenic to a degree? :-)

We already know that engine-to-engine games expose a weakness of the Elo model - namely, it assumes a fixed transitivity formula (A is x Elo above B, and B is y Elo above C, ergo A is x+y Elo above C), but it may not be correct in general, or it may not be correct across the Elo scale, or it may not be correct for all players (such as Leela). In particular, ratings in engine-to-engine Elo lists get compressed over time. Perhaps an alternative Elo curve - with a bit of schizophrenia thrown in - could be compression-free?
Transitivity is pretty well obeyed by regular engines, I even came to pretty conclusive results that on large Elo spans it obeys the logistic and not Gaussian distribution. Some 2-3 years ago this was unclear to me.
I am not sure what you mean by "ratings in engine-to-engine Elo lists get compressed over time". There are compressing lists, SSDF is one of them, but they since ages tried to accommodate it to human ratings and are using a dubious rating calculator. Also, their conditions varied over time, but they include all conditions in the list. CCRL uses BayesElo which is compressing, especially on large Elo spans. Also, in my experiments, BayesElo draw model (Rao-Kupper) is ruled out, but Davidson draw model (Ordo is using it) is not ruled out. I didn't see this compression in regular engine-engine ratings in identical conditions on ratings lists using Ordo. Elo does get compressed with stronger hardware, time control and stronger engines, due to higher draw rate, but not compressed "over time" in identical conditions. Maybe I missed something. In general, my impression is that in correct conditions, regular engines obey the (logistic) Elo model pretty or very well, even on large Elo spans. Leela is a "weird sick man" in this pool of regular engines (even on small Elo spans).