Schizophrenic rating model for Leela

Uri Blass · Post by **Uri Blass** » Wed Jan 23, 2019 3:39 pm

Laskos wrote: ↑Wed Jan 23, 2019 1:00 pm
Michel wrote: ↑Tue Jan 22, 2019 1:22 pm Nice! Lots of ideas!

So a fully schizophrenic Leela is just two players. The score of a regular engine against a fully schizophrenic Leela with given elo1,elo2 would be the same as its score against two regular players with elo1 and elo2 respectively.

A partially schizophrenic Leela is more complicated. It is like two players but one of them gets to play more than the other.
"A" is important in schizophrenia, but ELO1 - ELO2 is even more important in defining the degree of illness.

Maybe on the lines like "positional, tactical" (as test-suites show) one can improvise a plausible explanation of this schizophrenia.

On these lines, one can imagine two sorts of distributions of outcomes in many games:

An accumulation of many very small errors/advantages (limited variance), mostly positional advantages in the case of Leela, which leads to Gaussian statistics through the central limit theorem, well mimicked in CDF by a logistic. Very high ELO1 for Leela in a pool of regular engines might derive from this.

Some errors may have a Cauchy or Levy-like distributions (not defined or infinite variance) and we have the "Levy flight", where the total distance traveled (to the outcome of a game) is almost always dominated by the largest single or at most two errors. Here Leela "excels" in frequency of these sort of errors compared to strong regular engines, hence deriving its low ELO2.

Both lead in many games to CDF well approximated by logistics, but two different logistics. Regular engines are close enough in properties to assimilate both these properties into one single Elo logistic in ratings among them, so that an individual regular engine in a pool of other regular engines might be schizophrenic with ELO1 - ELO2 of say about small 50 Elo, 100 Elo even 200 Elo points. Lets's call this very mild schizophrenia "moody", and regular engine can be at most moody in a pool of regular engines. Their rating will be well described by A*ELO1 + (1-A)*ELO2, and with these rating it will obey the general Elo model. But Leela is so different that ELO1 - ELO2 is above 1000 Elo points (the fit gave 1070), and it cannot fit a logistic Elo model in a pool of regular engines by this simple weighted averaging, it is truly pathologically schizophrenic. I will plot the cases of moody regular engine (at most 200 points between ELO1 and ELO2) and schizophrenic Leela-like engine (1000 points between ELO1 and ELO2), at full "split-personality" (A=0.5):

Moody regular engine (ELO1 = 100, ELO2 = -100, A=0.5):

The blue line is the true sum of the two logistics separated by 200 Elo points. The brown line is a logistic with average (ELO1 + ELO2)/2 = 0. This moody regular engine still fits very well with a logistic given by average rating.

Schizophrenic Leela-like engine (ELO1 = 500, ELO2 = -500, A=0.5):

The blue line is the true sum of the two logistics separated by 1000 Elo points. The brown line is a logistic with average (ELO1 + ELO2)/2 = 0. Now we see that 1000+ Elo points difference makes a huge difference, the average logistic fit fails badly, and the true sum gives indeed compressed ratings, only explainable by ELO1, ELO2, A separately, and not averaging.

This sort of explanation, aside test-suites, where we see highly pathological positional/tactical behavior of Leela can be seen from comparing evals, and we also can see that regular engines probably never diverge by 200+ Elo points in their moodiness, as the rule of thumb for regular engines is that they are getting stronger positionally and tactically fairly hand in hand.

From an easy experiment, we can probably derive that compared to the simplest eval --- material + PST. even a top SF10 eval cannot get 1000 Elo points difference against a regular engine:

depth=1
Score of SF10 vs Predateur 2.1: 980 - 4 - 16 [0.988] 1000
Elo difference: 766.23 +/- 86.73
Finished match

In Predateur 2.1 even PST are somehow dubious.
But the search of Predateur is similarly weaker, by about the same Elo value. So, they will not look pathological in rating lists one to another, at most moody.

OTOH, Leela is by some 500 Elo points stronger than SF10 in eval, SF10 at depth 1 (about 20-30 nodes searched) and Leela at nodes=20, and 350 Elo points stronger even at nodes=1 than SF10 depth=1. And tactically, LC0 is like a weak regular engine, and 1000+ Elo points full blown pathology in ELO1 - ELO2 can be explained.

It would be interesting to have a basic eval with SF search, or viceversa, SF eval with basic search, to see if they exhibit rating pathology. Almost surely not to the degree Leela exhibits.

What about SF with a big time handicap?(let say slow it down by a factor of 100)

Are we going to find a similiar behaviour to leela?

Laskos · Post by **Laskos** » Wed Jan 23, 2019 6:17 pm

Uri Blass wrote: ↑Wed Jan 23, 2019 3:39 pm
What about SF with a big time handicap?(let say slow it down by a factor of 100)

Are we going to find a similiar behaviour to leela?

I don't think so. You probably mean that this 100x handicapped by speed Stockfish will be stronger positionally than tactically compared to other similar in strength regular engines. An effect can exist, but it seems low. A mildly moody Stockfish (in a pool of regular engines, with SF ELO1 and ELO2 differing by say 50 Elo points), with completely insignificant deviation from averaging logistic (ELO1+ELO2)/2 by your procedure will probably still have ELO1 and ELO2 within 50 Elo points. Due mostly to the fact that the positional strength of Stockfish (not only tactics) is also heavily relying on search. One can check, 100x slowing is 6-7 doublings, or maybe some 800 Elo points handicap. So, this SF10 will perform at some Ethereal 8.61 strength-wise level from this list:
http://fastgm.de/60-0.60.html

On positional openings200 repeated 5 time the results are:

SF10: 0.01s to 0.02s:
score=354/1000 [averages on correct positions: depth=7.6 time=0.01 nodes=47501]

Ethereal 8.61: 1.0s to 2.0s:
score=365/1000 [averages on correct positions: depth=9.1 time=0.40 nodes=1055015]

So, SF10 is not becoming a positional monster compared to its overall strength at ultra-ultra-fast time limits. It weakens significantly positionally too in this ultra-fast limit. With regular SF and similar regular engines, it is difficult to knock them out of normalcy of regular engines, making them "little Leela schizophrenics" in a pool of regular engines.

Uri Blass · Post by **Uri Blass** » Wed Jan 23, 2019 7:01 pm

Laskos wrote: ↑Wed Jan 23, 2019 6:17 pm
Uri Blass wrote: ↑Wed Jan 23, 2019 3:39 pm
What about SF with a big time handicap?(let say slow it down by a factor of 100)

Are we going to find a similiar behaviour to leela?
I don't think so. You probably mean that this 100x handicapped by speed Stockfish will be stronger positionally than tactically compared to other similar in strength regular engines. An effect can exist, but it seems low. A mildly moody Stockfish (in a pool of regular engines, with SF ELO1 and ELO2 differing by say 50 Elo points), with completely insignificant deviation from averaging logistic (ELO1+ELO2)/2 by your procedure will probably still have ELO1 and ELO2 within 50 Elo points. Due mostly to the fact that the positional strength of Stockfish (not only tactics) is also heavily relying on search. One can check, 100x slowing is 6-7 doublings, or maybe some 800 Elo points handicap. So, this SF10 will perform at some Ethereal 8.61 strength-wise level from this list:
http://fastgm.de/60-0.60.html

On positional openings200 repeated 5 time the results are:

SF10: 0.01s to 0.02s:
score=354/1000 [averages on correct positions: depth=7.6 time=0.01 nodes=47501]

Ethereal 8.61: 1.0s to 2.0s:
score=365/1000 [averages on correct positions: depth=9.1 time=0.40 nodes=1055015]

So, SF10 is not becoming a positional monster compared to its overall strength at ultra-ultra-fast time limits. It weaken significantly positionally too in this ultra-fast limit. With regular SF and similar regular engines, it is difficult to knock them out of normalcy of regular engines, making them "little Leela schizophrenics" in a pool of regular engines.

I think that not 100% of the advantage of stockfish relative to weak engines is because of better search so I expect 100x handicapped stockish to be relatively better positionally.

Maybe the estimate of 800 elo is simply wrong and stockfish at 0.01 seconds per move is clearly worse than Ethreal8.61 at 1.00 second per move and maybe the problem is that stockfish does bad positional mistakes at fast time control because of bad pruning that are not related to evaluation.

Stockfish of today still does the stupid pruning that can cause mistakes at small depths and here is an example

Stockfish has a score of more than +8 for 24.Bf3 at small depth simply because it prunes for some reason safe moves by the queen.

[d]5rk1/pB2pp1p/3n2pb/7q/2PP4/6Q1/PB3P1P/4R1K1 w - - 1 24

Stockfish_19012209_x64_modern:
1/1 00:00 250 479 +1.68 24.Bf3
2/2 00:00 441 846 +8.55 24.Bf3 Nxc4
3/3 00:00 635 1k +8.83 24.Bf3 Nxc4 25.Bxh5 Nxb2
4/4 00:00 809 2k +8.57 24.Bf3 Nxc4 25.Bxh5 Nxb2
5/5 00:00 1k 2k +3.56 24.Bf3 Nf5 25.Bxh5 Nxg3 26.hxg3
6/6 00:00 1k 3k +3.77 24.Bf3 Nf5 25.Bxh5 Nxg3 26.hxg3 gxh5
7/8 00:00 14k 26k -0.50 24.Ba3 Nf5 25.Qc7 Bd2 26.Rxe7 Nxd4 27.Kg2
8/9 00:00 16k 31k -1.04 24.Ba3 Nf5 25.Bf3 Nxg3 26.Bxh5 Nxh5 27.Rxe7 Nf4 28.Rxa7
9/13 00:00 25k 47k 0.00 24.Bd5 Nf5 25.Qf3 Qh4 26.Qg2
10/13 00:00 46k 84k -0.57 24.Bf3 Qa5 25.Qe5 Qxe5 26.dxe5 Nxc4 27.Bd4 Rd8 28.Bxa7 Bd2

Michel · Post by **Michel** » Thu Jan 24, 2019 1:41 am

Stockfish has a score of more than +8 for 24.Bf3 at small depth simply because it prunes for some reason safe moves by the queen.

This is because SF allows LMP in PV nodes. This is well-known to cause very visible blunders at depth <=5, but since it was not an elo loss at normal time controls it went in.

Laskos · Post by **Laskos** » Fri Jan 25, 2019 7:05 pm

Uri Blass wrote: ↑Wed Jan 23, 2019 7:01 pm
Laskos wrote: ↑Wed Jan 23, 2019 6:17 pm
Uri Blass wrote: ↑Wed Jan 23, 2019 3:39 pm
What about SF with a big time handicap?(let say slow it down by a factor of 100)

Are we going to find a similiar behaviour to leela?
I don't think so. You probably mean that this 100x handicapped by speed Stockfish will be stronger positionally than tactically compared to other similar in strength regular engines. An effect can exist, but it seems low. A mildly moody Stockfish (in a pool of regular engines, with SF ELO1 and ELO2 differing by say 50 Elo points), with completely insignificant deviation from averaging logistic (ELO1+ELO2)/2 by your procedure will probably still have ELO1 and ELO2 within 50 Elo points. Due mostly to the fact that the positional strength of Stockfish (not only tactics) is also heavily relying on search. One can check, 100x slowing is 6-7 doublings, or maybe some 800 Elo points handicap. So, this SF10 will perform at some Ethereal 8.61 strength-wise level from this list:
http://fastgm.de/60-0.60.html

On positional openings200 repeated 5 time the results are:

SF10: 0.01s to 0.02s:
score=354/1000 [averages on correct positions: depth=7.6 time=0.01 nodes=47501]

Ethereal 8.61: 1.0s to 2.0s:
score=365/1000 [averages on correct positions: depth=9.1 time=0.40 nodes=1055015]

So, SF10 is not becoming a positional monster compared to its overall strength at ultra-ultra-fast time limits. It weaken significantly positionally too in this ultra-fast limit. With regular SF and similar regular engines, it is difficult to knock them out of normalcy of regular engines, making them "little Leela schizophrenics" in a pool of regular engines.
I think that not 100% of the advantage of stockfish relative to weak engines is because of better search so I expect 100x handicapped stockish to be relatively better positionally.

Maybe the estimate of 800 elo is simply wrong and stockfish at 0.01 seconds per move is clearly worse than Ethreal8.61 at 1.00 second per move and maybe the problem is that stockfish does bad positional mistakes at fast time control because of bad pruning that are not related to evaluation.

Stockfish of today still does the stupid pruning that can cause mistakes at small depths and here is an example

Stockfish has a score of more than +8 for 24.Bf3 at small depth simply because it prunes for some reason safe moves by the queen.

[d]5rk1/pB2pp1p/3n2pb/7q/2PP4/6Q1/PB3P1P/4R1K1 w - - 1 24

Stockfish_19012209_x64_modern:
1/1 00:00 250 479 +1.68 24.Bf3
2/2 00:00 441 846 +8.55 24.Bf3 Nxc4
3/3 00:00 635 1k +8.83 24.Bf3 Nxc4 25.Bxh5 Nxb2
4/4 00:00 809 2k +8.57 24.Bf3 Nxc4 25.Bxh5 Nxb2
5/5 00:00 1k 2k +3.56 24.Bf3 Nf5 25.Bxh5 Nxg3 26.hxg3
6/6 00:00 1k 3k +3.77 24.Bf3 Nf5 25.Bxh5 Nxg3 26.hxg3 gxh5
7/8 00:00 14k 26k -0.50 24.Ba3 Nf5 25.Qc7 Bd2 26.Rxe7 Nxd4 27.Kg2
8/9 00:00 16k 31k -1.04 24.Ba3 Nf5 25.Bf3 Nxg3 26.Bxh5 Nxh5 27.Rxe7 Nf4 28.Rxa7
9/13 00:00 25k 47k 0.00 24.Bd5 Nf5 25.Qf3 Qh4 26.Qg2
10/13 00:00 46k 84k -0.57 24.Bf3 Qa5 25.Qe5 Qxe5 26.dxe5 Nxc4 27.Bd4 Rd8 28.Bxa7 Bd2

No, just experimented with Stockfish at 0.02s/move (still some 35,000 nodes per move, or depths 9-14) compared to Fruit 2.1 at 2s/move:

Score of SF10 vs Fruit21: 48 - 35 - 17 [0.565] 100
Elo difference: 45.42 +/- 63.09
Finished match

SF10 in these conditions is a bit stronger than Fruit 2.1, fairly well matched.

On positional Openings200 5 times, SF10 from 0.01s to 0.02s per position hadt:
354/1000

Fruit 2.1 from 1.0s to 2.0s per position had:
350/1000

On tactical Arasan19 (200 positions)
SF10 0.01-0.02s:
15/200
Fruit 2.1 1.0-2.0s:
7/200

So, SF10 is not becoming a positional monster with weak tactics at 100x shorter TC, as you suggest. It becomes more like a Fruit 2.1 engine. That it misses in positional play due to bad pruning is no excuse. Lc0 test30 new nets, with just 1 node search on Openings200 scores:
580/1000.
Well above of what SF10 on 4 cores solves at 2s/position, or some depths 19-23 and 13,000,000 nodes. With just one node of Lc0. Yes, Lc0 IS a positional monster without any search, never mind some "pruning flaws" of deep search of SF.

Leela is a pathological engine hardly mimicked by any of regular engines in any conceivable situations. Elo model will work with most (uniform) conditions of regular engines, but hardly with any condition of Leela in a pool of regular engines.

Uri Blass · Post by **Uri Blass** » Fri Jan 25, 2019 11:10 pm

Thanks for the information.

I still believe stockfish should have a better evaluation function relative to fruit because fruit is an old engine with free code and from the time of fruit
people learned a lot of new things about the evaluation.

It seems that you claim that superior evaluation of stockfish does not help it to be better positionally relative to fruit in your conditions because fruit search deeper in the relevant positional lines.

You may be right but I doubt if your positional and tactical suites represent tactical and positional abilities in playing chess.
The only way to get convincing conclusions is simply by analyzing the games between stockfish10 and fruit2.1 in order to find how many games one program won because of a clear tactical mistake(based on the evaluation of both programs) and how many games one program won without tactical mistakes.

clear tactical mistake means that after the mistake the program that did the mistake admit by big difference in evaluation that it did a mistake(for example white lost the game and has evaluation of 0.20 at move 23 and evaluation of -1.34 at move 24)

Laskos · Post by **Laskos** » Sat Jan 26, 2019 12:41 pm

Uri Blass wrote: ↑Fri Jan 25, 2019 11:10 pm Thanks for the information.

I still believe stockfish should have a better evaluation function relative to fruit because fruit is an old engine with free code and from the time of fruit
people learned a lot of new things about the evaluation.

It seems that you claim that superior evaluation of stockfish does not help it to be better positionally relative to fruit in your conditions because fruit search deeper in the relevant positional lines.

You may be right but I doubt if your positional and tactical suites represent tactical and positional abilities in playing chess.
The only way to get convincing conclusions is simply by analyzing the games between stockfish10 and fruit2.1 in order to find how many games one program won because of a clear tactical mistake(based on the evaluation of both programs) and how many games one program won without tactical mistakes.

clear tactical mistake means that after the mistake the program that did the mistake admit by big difference in evaluation that it did a mistake(for example white lost the game and has evaluation of 0.20 at move 23 and evaluation of -1.34 at move 24)

Yes, you might be right, after all. On tactical ECM suite, Fruit 2.1 scores significantly better than SF10 in these 1/100 conditions. It is more significant than Arasan tactical test-suite, Arasan is too hard, and positions there are especially anti-engine (engine analyzed for hardness).

I will just check if SF10 at 0.01s/move behaves regularly against Ethereal 8.16 and Ethereal 9.30 at 1s/move (500 Elo points between these Ethereals in FGRL (Ordo) list). It may happen that in these conditions SF10 also compresses the ratings.
Will give the results by the evening.

Laskos · Post by **Laskos** » Sat Jan 26, 2019 10:09 pm

Laskos wrote: ↑Sat Jan 26, 2019 12:41 pm
Uri Blass wrote: ↑Fri Jan 25, 2019 11:10 pm Thanks for the information.

I still believe stockfish should have a better evaluation function relative to fruit because fruit is an old engine with free code and from the time of fruit
people learned a lot of new things about the evaluation.

It seems that you claim that superior evaluation of stockfish does not help it to be better positionally relative to fruit in your conditions because fruit search deeper in the relevant positional lines.

You may be right but I doubt if your positional and tactical suites represent tactical and positional abilities in playing chess.
The only way to get convincing conclusions is simply by analyzing the games between stockfish10 and fruit2.1 in order to find how many games one program won because of a clear tactical mistake(based on the evaluation of both programs) and how many games one program won without tactical mistakes.

clear tactical mistake means that after the mistake the program that did the mistake admit by big difference in evaluation that it did a mistake(for example white lost the game and has evaluation of 0.20 at move 23 and evaluation of -1.34 at move 24)
Yes, you might be right, after all. On tactical ECM suite, Fruit 2.1 scores significantly better than SF10 in these 1/100 conditions. It is more significant than Arasan tactical test-suite, Arasan is too hard, and positions there are especially anti-engine (engine analyzed for hardness).

I will just check if SF10 at 0.01s/move behaves regularly against Ethereal 8.16 and Ethereal 9.30 at 1s/move (500 Elo points between these Ethereals in FGRL (Ordo) list). It may happen that in these conditions SF10 also compresses the ratings.
Will give the results by the evening.

Yes, I seem to get an effect of flattening in Elo differences using 1/100 Stockfish 10. SF 10 is playing at 0.01s/move, Ethereal at 1.0s/move.

First match:

Score of SF 10 vs Ethereal 8.16: 114 - 143 - 80 [0.457] 337
Elo difference: -29.97 +/- 32.53

Second match:

Score of SF 10 vs Ethereal 9.30: 15 - 426 - 59 [0.089] 500
Elo difference: -404.05 +/- 43.26

Difference: 374 +/- 53 Elo points
Difference in FGRL rating list at 60''+0.6'' (small error margins) is 495 Elo points. So, outside error margins compression, much smaller compression than in Leela case, but unusual one for regular engines. If one assumes A close to 0.5 (close to equal contributions of ELO1 and ELO2), then ELO1 - ELO2 ~= 450 +/- 70 ELO points. SF 10 on 1 core at 0.01s/move is about 1150 Elo points weaker than at 1.0s/move. If one says "ELO1 is more a positional Elo" and "ELO2 is more a tactical Elo", then one can separate this 1150 Elo points coming form 1/100 TC into two components: 800 Elo points tactical weakening and 350 Elo points positional weakening.

So, your improvised Stockfish 1/100 IS a bit schizophrenic in a pool of regular engines. Not to a degree Leela is (it flattens much more strongly Elo differences against a pool of regular engines), but is behaving pretty unusually for a regular engine.

Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela