Wilo rating properties from FGRL rating lists

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Posts: 10650
Joined: Wed Jul 26, 2006 8:21 pm

Re: Wilo rating properties from FGRL rating lists

Dann Corbit wrote:
Hypothetical extreme (to absurd) examples don't help much. That's why I am using empirical data from Andreas' excellent FGRL rating lists with consistent participants and significant number of games, from short to long time controls.
{snip}
I think it is essential that these things are addressed by the model.
For instance, if two opponents (engine A and engine B) hypothetically play a mole of games (6x10^23) and have all draws except 100 and engine A wins all 100, I maintain that the two engines have exactly the same strength for all practical purposes. Despite the win "domination" by A, it is not stronger than B.

On the other hand, a model has exactly the value of its predictive behavior. So if the wins and losses only model can predict better than a model which also includes draws, then the win loss only model is better.

If the model makes absurd predictions (such as "A is stronger than B" after the above experiment) then the model needs a tweak to be able to predict correctly.

IMO-YMMV.
LOS (Likelihood Of Superiority) with a uniform ("uninformed") prior is independent of the number of draws. So, +100 =0 -0 has the same Likelihood Of Superiority as +100 =10^24 -0. You just have some "feeling", which in this case can be described by a non-uniform prior (we "feel" some prior), a thing we rarely do in Chess ratings, and they are anyway subject to some sort of "belief".

Dann Corbit
Posts: 11528
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Wilo rating properties from FGRL rating lists

Dann Corbit wrote:
Hypothetical extreme (to absurd) examples don't help much. That's why I am using empirical data from Andreas' excellent FGRL rating lists with consistent participants and significant number of games, from short to long time controls.
{snip}
I think it is essential that these things are addressed by the model.
For instance, if two opponents (engine A and engine B) hypothetically play a mole of games (6x10^23) and have all draws except 100 and engine A wins all 100, I maintain that the two engines have exactly the same strength for all practical purposes. Despite the win "domination" by A, it is not stronger than B.

On the other hand, a model has exactly the value of its predictive behavior. So if the wins and losses only model can predict better than a model which also includes draws, then the win loss only model is better.

If the model makes absurd predictions (such as "A is stronger than B" after the above experiment) then the model needs a tweak to be able to predict correctly.

IMO-YMMV.
LOS (Likelihood Of Superiority) with a uniform ("uninformed") prior is independent of the number of draws. So, +100 =0 -0 has the same Likelihood Of Superiority as +100 =10^24 -0. You just have some "feeling", which in this case can be described by a non-uniform prior (we "feel" some prior), a thing we rarely do in Chess ratings, and they are anyway subject to some sort of "belief".
You will see a win by the "superior" side every 1.6e22 trials (essentially never).

I don't like the model if it says that A is clearly superior to B. That model is wrong.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Posts: 10650
Joined: Wed Jul 26, 2006 8:21 pm

Re: Wilo rating properties from FGRL rating lists

lkaufman wrote:
That is very interesting, it would indeed be nice if we could use WILO and expect results to be similar at any time control. I suspect that WILO might also make the elo differences less dependent on choice of opening book. With normal elo, books that end in equal positions will show smaller elo differences than books that end with one side half a pawn or so ahead. Maybe this would not be true for WILO.
Just curious, how would the conclusion have changed if you included the tenth engine?
Yes, maybe your guess on opening books is correct. Closer to very balanced positions books show both higher draw rates and smaller Elo differences. But somewhat more unbalanced positions books, with lower draw rates and larger Elo differences, probably exhibit more Wins and more Losses, with a bit larger effect on Wins. This might mean that Wins/Losses ratio might be pretty stable for both sorts of books.

I the downloaded databases, the tenth engine in one list was Fritz, and in the second list Chiron, so I couldn't check for their Elo/Wilo from 10min to 60min. I may ask Andreas to include both in a list of 11 engines and send me new databases, it requires some work from him, but in the past he was very helpful.
Andreas was very kind to provide me with all I needed. I have top 11 identical engines, and now I have 60s + 0.6s versus 3600s + 15s span. For this huge span (almost a factor of 60 in time control), I have the following:

To LTC

Elo deflates rating differences by 46.1%
Normalized Elo (discussed earlier in this and other threads) deflates rating differences by 23.3%
Wilo inflates rating differences by 8.9%.

All in all, on this x60 time span, 8.9% is not that bad for Wilo.

Here is the plot of these top 11 engines in Wilo terms, where their scaling can be seen too:

The best scaling are Boot 6.2 and Andscacs 0.92. The worst scaling is Fritz 16.

clumma
Posts: 177
Joined: Fri Oct 10, 2014 8:05 pm
Location: Berkeley, CA

Re: Wilo rating properties from FGRL rating lists

Dann Corbit wrote:For instance, if two opponents (engine A and engine B) hypothetically play a mole of games (6x10^23) and have all draws except 100 and engine A wins all 100, I maintain that the two engines have exactly the same strength for all practical purposes.
I do think that's very unlikely, especially if each engine plays black half the time.

However, consider AlphaZero's score against SF: 318/958/24. Scores of 318/479/24 or 318/1916/24 do not seem impossible to me. But the Elo and "playing strength" would be very different in these three cases.

-Carl

clumma
Posts: 177
Joined: Fri Oct 10, 2014 8:05 pm
Location: Berkeley, CA

Re: Wilo rating properties from FGRL rating lists

I would like to see a chart with Elo and Wilo on x and y axes, with engines as points, for one of Andreas' tournaments. Have you made such a thing? It would be best to include all engines, not just the top 11.

-Carl

clumma
Posts: 177
Joined: Fri Oct 10, 2014 8:05 pm
Location: Berkeley, CA

Re: Wilo rating properties from FGRL rating lists

clumma wrote:
Dann Corbit wrote:For instance, if two opponents (engine A and engine B) hypothetically play a mole of games (6x10^23) and have all draws except 100 and engine A wins all 100, I maintain that the two engines have exactly the same strength for all practical purposes.
I do think that's very unlikely, especially if each engine plays black half the time.
Actually maybe not. Maybe A is a perfect engine (32-man tablebase) and B is very very close to perfect, making only one mistake in every 3e21 games. When playing white, this mistake isn't decisive, but when playing black it is, causing it to lose 100 games.

Making one mistake in 3e21 games is a skill for which Wilo gives no credit.

-Carl

Posts: 10650
Joined: Wed Jul 26, 2006 8:21 pm

Re: Wilo rating properties from FGRL rating lists

clumma wrote:
"Solving" a non-problem since playing strength difference between engines (and humans) obviously does shrink with LTC.
What is "playing strength" in your understanding?
In another thread on intrinsic ratings, it appeared to me that you and I have different conceptions of Elo. For me it is not a statistical procedure to predict outcome of games. That is how it is defined, but it measures something much more important: playing strength.

This playing strength could be rigorously defined for individual moves. I won't attempt to do so here but consider a hypothetical 32-man tablebase. In any position, it can order legal moves by depth to mate, then inverse of depth to draw, and finally inverse of depth to mate. The "strength" of any of these moves could be defined by this ordering. The "strength" of any list of moves could be some average of the individual move scores.

My assertion is that Elo approximates this score, and that WILO does so comparatively poorly.

-Carl
Well, I can concoct on the spot something related: percentage of the moves which don't change the theoretical outcome of the game. Theoretical outcome given by 32-men bases. We have some numbers on these: on 6-men regular endgames top engines do the job at least in 95% of the games against 6-men bases, and these are sequences of say 20-30 moves. So, 99.8% of the moves are perfect moves (again, in regular 6-men endgames, not on some very hard suites). It may well be that top engines play the full games of chess with 70-80% perfect moves. And that percentage is some sort of "intrinsic" strength of an engine. Say 77% for Stockfish and 76% for Komodo at certain time control and hardware. But I am unsure how to relate that to anything like Elo or Wilo, I am not sure that it is necessarily even a monotonously increasing function of Elo. A "swindling" engine with lower perfect-moves percentage might have higher Elo than an engine with higher perfect-move percentage.